nachetecanon / google-enterprise-connector-manager

Automatically exported from code.google.com/p/google-enterprise-connector-manager
Apache License 2.0
0 stars 0 forks source link

Support Compressed Content Feeds #153

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
To improve feed throughput of content+metadata feeds, we have decided to 
compress the 
document content in the feed.  We have decided to use ZLIB to compress the raw 
content, then 
Base64 encode the result.

This is done using a new CompressedFilterInputStream that sits between the 
content InputStream 
and the Base64FilterInputStream in DocPusher.

The use of compression is subject to appropriate support in the GSA, so the 
FeedConnection 
interface was enhanced to provide the supported content encodings (among other 
enhancements).

Original issue reported on code.google.com by Brett.Mi...@gmail.com on 26 May 2009 at 4:39

GoogleCodeExporter commented 8 years ago
================
From Brett:

Marty,

We almost certainly want to compress, then Base64 encode the
compressed data.  Three reasons:
  1)  I have a gut (unproven) feeling that gzip might do a better 
       job compressing the raw data vs the encoded data.
  2) That is the order that the iHarder Base64 encoder uses.
  3) The resulting Base64 encoding has no XML special characters
       that would need to be escaped in the feed.

I prefer gzip over zip or rar for this purpose because gzip does 
not do archiving, so uncompressing would not result in the 
possibility of producing multiple files.

As far as sending multiple documents in the feed goes, it is my
understanding that the status return simply acknowledges the
receipt of the feed XML file.  The feed file is then queued for 
future processing.  Any failures processing the feed data are
not communicated back to the CM due to the asynchronous
feed model.

From John:

I'm sure there are compression experts running around. Here's a quick test 
using the Perl 
MIME::Decoder::Base64 module:

Word file: 11028 KB
Base64-encoded, then gzipped: 4097 KB in 1.7 seconds
gzipped, then Base64-encoded: 4126 KB in 0.8 seconds

google-connectors log file: 49874 KB bytes
Base64-encoded, then gzipped: 15870 KB in 10.9 seconds
gzipped, then Base64-encoded: 7538 KB in 3.5 seconds

I used a different Base64 encoder, so the timing is not directly relevant. It 
makes sense, though, because 
the gzip-first approach touches less data, unless the compression increases the 
size by more than a 
nominal amount.

If it's Base64-encoded then gzipped, we have to gzip the entire XML envelop of 
the feed, where if it's 
gzipped then Base64-encoded, that's just a variation on the content element as 
you mentioned.

I don't know what technology the feedergate uses, but we should check whether 
it supports Content-
Encoding: gzip using Apache/mod_deflate or similar. That may just be even 
easier than supporting gzip in 
the parser, although it may not perform as well.

I agree with Brett on the feed status. All we get back is that the server 
accepted the feed, which isn't 
different for multiple documents compared to a single document. I'm OK with 
that.

It would be much trickier to break the Base64 into lines just for the 
teedFeedFile. It would be doable, but it 
would mean more extra scanning of the stream. Can you run that by the GSA team, 
too, whether the feed 
parser could handle the line breaks? Also, whether it handles arbitrary breaks 
or would require 72 or 76 
character lines? The goal is just to not have 16 MB lines in a file, so it 
might be faster and easier to break 
lines when the I/O loop finishes encoding a 2K chunk or whatever size it is.

John L

Original comment by Brett.Mi...@gmail.com on 26 May 2009 at 4:41

GoogleCodeExporter commented 8 years ago

On Wed, May 20, 2009 at 11:02 PM, Mohit Oberoi wrote:
Hi,

John and Marty mentioned supporting compressed content feeds and I saw the 
comments in the connector 
bug as well related to this. I looked at some code and ran some tests and it 
would be easy to implement 
compression if we do this:

content is compressed using zlib 
(http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/Deflater.html) and 
the base-64 encoded. rest of the feed file remains unchaged. Does this make 
sense or do we have more 
drastic/long-term changes in mind? A simple test I ran using a 12mb MS Word 
file resulted in ~14m when 
just base-64 encoded and ~2m when compressed+base-64 encoded, so benefits are 
good.

thanks,
-mohit

From Marty:

Mohit,

I just added you to a related email discussion John and Brett and I had on the 
subject.  It's rather long 
because it contains a code review so I've attached the relevant parts below.

At a high level, "yes" this makes sense and is want we had in mind (assuming 
s/and the base-64/and then 
base-64/ in your summary - that is, content is compressed and *then* base-64 
encoded).

I believe Brett was lobbying for gzip rather than zlib and I don't know if the 
resulting compression would be 
different.  I know there are license issues on the GSA so let us know if that's 
a problem.

I'll update the bug with this information.

    Marty

From Brett:

One more thing.  If we support this in the connector manager,
we need some way of finding out if the feed host supports
compressed, base64 encoded content.

From Marty:

I agree this will be needed in some form.  We can either pass an argument to 
the feedergate like Mohit's 
current implementation or something could be set at the time the GSA is 
registered with the CM.  Since 
we're moving toward having the Installer perform the registration, that would 
mean we need to be able to 
detect the feature using GData,

I also agree, something explicit in the feed DTD would be preferred to sniffing 
the initial bytes but I'm not 
sure how standard that operation is.

From Mohit:

dtd is exposed in the search frontend (eg. http://stdioc10/gsafeed.dtd), so if 
we add a different attribute 
value for compressed {I like that since it is explicit}, the installer can 
auto-detect by fetching the page and if 
that fails, asking the customer. To keep things consistent, I can make 
feedergate port (19900) also expose 
the dtd.

Will look into the zlib licensing issues a bit later today and get back to you.

thanks,
-mohit

From Mohit:
ok, so this is what I found:

We are allowed to use both zlib and gzip on the GSA. I have tested zlib and 
gzip end-to-end (created a Java 
program that compresses and then base-64 encodes the content and then modified 
the GSA code to handle 
both cases and index the doc). From what I can tell, it is very easy to detect 
gzip compressed content (it has 
a header), but no simple (non-hacky) way of doing this for zlib compression. So 
I am okay with either of 
them. One "advantage" of using gzip is that we don't need to introduce another 
attribute (we can keep using 
base64binary) and have a different way to detect if the GSA can accept gzip 
data (a comment in the dtd or 
endpoint or check software version), but as we discussed, having a separate 
attribute eg. 
base64binarycompressed will be more explicit. GSA code will be easier if we 
just use gzip and do auto-
detection, but the difference is so minor that it doesn't matter at all. 

Let me know how we want to proceed.

-mohit

From Marty:
I'll let you and Brett work out the details.  A question and a comment:

Question: What do we do with a document from the Connector that was already a 
zip file?  For example, 
Qualcomm had those 300 MB zip files containing hundreds of source code files.  
From what I understand, 
the GSA won't be able to do anything good with that file.  If we're not able to 
detect it on the CM side and 
decide to compress, encode and send, seems this might be a good argument for 
using explicit attribute.  
Otherwise you won't know if it was a single document compressed as part of 
sending the content or an 
original (evil) archive containing several documents.

Comment: Seems the gzip format was designed to retain directory info about a 
file but we have most of that 
info as metadata so I believe the zlib format would be slightly better.  It was 
designed for in-memory and 
commo channel apps and has a more compact header and trailer and uses a faster 
integrity check than 
gzip.  But again, your call.

Take care,
     Marty

From Brett:

This was one of the primary reasons I favored gzip over zip.  
On the other hand, TraversalContext currently explicitly excludes
compressed archive files for indexing.

However, if you look at java.util.zip.Deflater(int, boolean) constructor,
it says:
  " If 'nowrap' is true then the ZLIB header and checksum fields
    will not be used in order to support the compression format
    used in both GZIP and PKZIP."

The fact that it mentions "ZLIB header and checksum fields"
means we might be able identify the ZLIB header.  But closer
inspection indicates the "header" is extremely difficult to 
identify.  The only thing for sure is the low nibble of the 
first byte must be 8.

You might also rely on the special GZIP header that ZLIB 
generates in gzip compatible mode.  From the zlib.h doc:
"The gzip header will have no                                                   

   file name, no extra data, no comment, no modification time (set to zero),                                                                                               
   no header crc, and the operating system will be set to 255 (unknown).  If a                                                                                              
   gzip stream is being written, strm->adler is a crc32 instead of an adler32."

At this point, I would agree that an using unwrapped ZLIB, 
plus an explicit attribute to signal compression would be best.

Brett Johnson

From Mohit:
Sounds good. zlib and base64binarycompressed  it is.

I will expose the dtd via port 19900 as well (0:19900/dtd) . The mere presence 
of it will be sufficient to 
know if compression is accepted or you can check for the attribute. 

I will update the bug with this info momentarily.

-mohit 

Original comment by Brett.Mi...@gmail.com on 26 May 2009 at 4:50

GoogleCodeExporter commented 8 years ago

Original comment by Brett.Mi...@gmail.com on 29 May 2009 at 4:32

GoogleCodeExporter commented 8 years ago
Fixed  13 June 2009 in Connector Manager revision r2128
To improve feed throughput of content+metadata feeds, we have decided to 
compress the document content 
in the feed.  We have decided to use ZLIB to compress the raw content, then 
Base64 encode the result.

This is done using a new CompressedFilterInputStream that sits between the 
content InputStream and the 
Base64FilterInputStream in DocPusher.

The use of compression is subject to appropriate support in the GSA, so the 
FeedConnection interface was 
enhanced to provide the supported content encodings (among other enhancements).

Original comment by Brett.Mi...@gmail.com on 13 Jul 2009 at 9:23

GoogleCodeExporter commented 8 years ago
Although the CM side of the work is done, I am leaving this issue open until I 
can test against a GSA that 
supports compressed feeds.

Original comment by Brett.Mi...@gmail.com on 13 Jul 2009 at 9:24

GoogleCodeExporter commented 8 years ago
In the latest 6.2 beta, the dtd is returned via

http://gsa:19900/getdtd

and the encodings supported are:

base64binary | base64compressed

Original comment by Brett.Mi...@gmail.com on 9 Sep 2009 at 9:25

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 19 Sep 2009 at 4:03

GoogleCodeExporter commented 8 years ago
Full support for compressed content feeds was added 11 September 2009 in 
revision r2246

Adds Detection of GSA compressed feed support.

This change enables GsaFeedConnection to determine whether
the configured GSA supports compressed content feeds.
GSA that support compressed content respond to a getdtd
request on the feed port.  The returned DTD includes
content encodings of (base64binary|base64compressed).

As it is, any GSA that returns a DTD also supports
compressed content feeds, so I merely check for the
presense of the dtd.  In future we should extract the
supported content encodings from the returned DTD.

Original comment by Brett.Mi...@gmail.com on 6 Oct 2009 at 9:17

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 27 Oct 2009 at 11:05