Closed GoogleCodeExporter closed 8 years ago
================
From Brett:
Marty,
We almost certainly want to compress, then Base64 encode the
compressed data. Three reasons:
1) I have a gut (unproven) feeling that gzip might do a better
job compressing the raw data vs the encoded data.
2) That is the order that the iHarder Base64 encoder uses.
3) The resulting Base64 encoding has no XML special characters
that would need to be escaped in the feed.
I prefer gzip over zip or rar for this purpose because gzip does
not do archiving, so uncompressing would not result in the
possibility of producing multiple files.
As far as sending multiple documents in the feed goes, it is my
understanding that the status return simply acknowledges the
receipt of the feed XML file. The feed file is then queued for
future processing. Any failures processing the feed data are
not communicated back to the CM due to the asynchronous
feed model.
From John:
I'm sure there are compression experts running around. Here's a quick test
using the Perl
MIME::Decoder::Base64 module:
Word file: 11028 KB
Base64-encoded, then gzipped: 4097 KB in 1.7 seconds
gzipped, then Base64-encoded: 4126 KB in 0.8 seconds
google-connectors log file: 49874 KB bytes
Base64-encoded, then gzipped: 15870 KB in 10.9 seconds
gzipped, then Base64-encoded: 7538 KB in 3.5 seconds
I used a different Base64 encoder, so the timing is not directly relevant. It
makes sense, though, because
the gzip-first approach touches less data, unless the compression increases the
size by more than a
nominal amount.
If it's Base64-encoded then gzipped, we have to gzip the entire XML envelop of
the feed, where if it's
gzipped then Base64-encoded, that's just a variation on the content element as
you mentioned.
I don't know what technology the feedergate uses, but we should check whether
it supports Content-
Encoding: gzip using Apache/mod_deflate or similar. That may just be even
easier than supporting gzip in
the parser, although it may not perform as well.
I agree with Brett on the feed status. All we get back is that the server
accepted the feed, which isn't
different for multiple documents compared to a single document. I'm OK with
that.
It would be much trickier to break the Base64 into lines just for the
teedFeedFile. It would be doable, but it
would mean more extra scanning of the stream. Can you run that by the GSA team,
too, whether the feed
parser could handle the line breaks? Also, whether it handles arbitrary breaks
or would require 72 or 76
character lines? The goal is just to not have 16 MB lines in a file, so it
might be faster and easier to break
lines when the I/O loop finishes encoding a 2K chunk or whatever size it is.
John L
Original comment by Brett.Mi...@gmail.com
on 26 May 2009 at 4:41
On Wed, May 20, 2009 at 11:02 PM, Mohit Oberoi wrote:
Hi,
John and Marty mentioned supporting compressed content feeds and I saw the
comments in the connector
bug as well related to this. I looked at some code and ran some tests and it
would be easy to implement
compression if we do this:
content is compressed using zlib
(http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/Deflater.html) and
the base-64 encoded. rest of the feed file remains unchaged. Does this make
sense or do we have more
drastic/long-term changes in mind? A simple test I ran using a 12mb MS Word
file resulted in ~14m when
just base-64 encoded and ~2m when compressed+base-64 encoded, so benefits are
good.
thanks,
-mohit
From Marty:
Mohit,
I just added you to a related email discussion John and Brett and I had on the
subject. It's rather long
because it contains a code review so I've attached the relevant parts below.
At a high level, "yes" this makes sense and is want we had in mind (assuming
s/and the base-64/and then
base-64/ in your summary - that is, content is compressed and *then* base-64
encoded).
I believe Brett was lobbying for gzip rather than zlib and I don't know if the
resulting compression would be
different. I know there are license issues on the GSA so let us know if that's
a problem.
I'll update the bug with this information.
Marty
From Brett:
One more thing. If we support this in the connector manager,
we need some way of finding out if the feed host supports
compressed, base64 encoded content.
From Marty:
I agree this will be needed in some form. We can either pass an argument to
the feedergate like Mohit's
current implementation or something could be set at the time the GSA is
registered with the CM. Since
we're moving toward having the Installer perform the registration, that would
mean we need to be able to
detect the feature using GData,
I also agree, something explicit in the feed DTD would be preferred to sniffing
the initial bytes but I'm not
sure how standard that operation is.
From Mohit:
dtd is exposed in the search frontend (eg. http://stdioc10/gsafeed.dtd), so if
we add a different attribute
value for compressed {I like that since it is explicit}, the installer can
auto-detect by fetching the page and if
that fails, asking the customer. To keep things consistent, I can make
feedergate port (19900) also expose
the dtd.
Will look into the zlib licensing issues a bit later today and get back to you.
thanks,
-mohit
From Mohit:
ok, so this is what I found:
We are allowed to use both zlib and gzip on the GSA. I have tested zlib and
gzip end-to-end (created a Java
program that compresses and then base-64 encodes the content and then modified
the GSA code to handle
both cases and index the doc). From what I can tell, it is very easy to detect
gzip compressed content (it has
a header), but no simple (non-hacky) way of doing this for zlib compression. So
I am okay with either of
them. One "advantage" of using gzip is that we don't need to introduce another
attribute (we can keep using
base64binary) and have a different way to detect if the GSA can accept gzip
data (a comment in the dtd or
endpoint or check software version), but as we discussed, having a separate
attribute eg.
base64binarycompressed will be more explicit. GSA code will be easier if we
just use gzip and do auto-
detection, but the difference is so minor that it doesn't matter at all.
Let me know how we want to proceed.
-mohit
From Marty:
I'll let you and Brett work out the details. A question and a comment:
Question: What do we do with a document from the Connector that was already a
zip file? For example,
Qualcomm had those 300 MB zip files containing hundreds of source code files.
From what I understand,
the GSA won't be able to do anything good with that file. If we're not able to
detect it on the CM side and
decide to compress, encode and send, seems this might be a good argument for
using explicit attribute.
Otherwise you won't know if it was a single document compressed as part of
sending the content or an
original (evil) archive containing several documents.
Comment: Seems the gzip format was designed to retain directory info about a
file but we have most of that
info as metadata so I believe the zlib format would be slightly better. It was
designed for in-memory and
commo channel apps and has a more compact header and trailer and uses a faster
integrity check than
gzip. But again, your call.
Take care,
Marty
From Brett:
This was one of the primary reasons I favored gzip over zip.
On the other hand, TraversalContext currently explicitly excludes
compressed archive files for indexing.
However, if you look at java.util.zip.Deflater(int, boolean) constructor,
it says:
" If 'nowrap' is true then the ZLIB header and checksum fields
will not be used in order to support the compression format
used in both GZIP and PKZIP."
The fact that it mentions "ZLIB header and checksum fields"
means we might be able identify the ZLIB header. But closer
inspection indicates the "header" is extremely difficult to
identify. The only thing for sure is the low nibble of the
first byte must be 8.
You might also rely on the special GZIP header that ZLIB
generates in gzip compatible mode. From the zlib.h doc:
"The gzip header will have no
file name, no extra data, no comment, no modification time (set to zero),
no header crc, and the operating system will be set to 255 (unknown). If a
gzip stream is being written, strm->adler is a crc32 instead of an adler32."
At this point, I would agree that an using unwrapped ZLIB,
plus an explicit attribute to signal compression would be best.
Brett Johnson
From Mohit:
Sounds good. zlib and base64binarycompressed it is.
I will expose the dtd via port 19900 as well (0:19900/dtd) . The mere presence
of it will be sufficient to
know if compression is accepted or you can check for the attribute.
I will update the bug with this info momentarily.
-mohit
Original comment by Brett.Mi...@gmail.com
on 26 May 2009 at 4:50
Original comment by Brett.Mi...@gmail.com
on 29 May 2009 at 4:32
Fixed 13 June 2009 in Connector Manager revision r2128
To improve feed throughput of content+metadata feeds, we have decided to
compress the document content
in the feed. We have decided to use ZLIB to compress the raw content, then
Base64 encode the result.
This is done using a new CompressedFilterInputStream that sits between the
content InputStream and the
Base64FilterInputStream in DocPusher.
The use of compression is subject to appropriate support in the GSA, so the
FeedConnection interface was
enhanced to provide the supported content encodings (among other enhancements).
Original comment by Brett.Mi...@gmail.com
on 13 Jul 2009 at 9:23
Although the CM side of the work is done, I am leaving this issue open until I
can test against a GSA that
supports compressed feeds.
Original comment by Brett.Mi...@gmail.com
on 13 Jul 2009 at 9:24
In the latest 6.2 beta, the dtd is returned via
http://gsa:19900/getdtd
and the encodings supported are:
base64binary | base64compressed
Original comment by Brett.Mi...@gmail.com
on 9 Sep 2009 at 9:25
Original comment by jl1615@gmail.com
on 19 Sep 2009 at 4:03
Full support for compressed content feeds was added 11 September 2009 in
revision r2246
Adds Detection of GSA compressed feed support.
This change enables GsaFeedConnection to determine whether
the configured GSA supports compressed content feeds.
GSA that support compressed content respond to a getdtd
request on the feed port. The returned DTD includes
content encodings of (base64binary|base64compressed).
As it is, any GSA that returns a DTD also supports
compressed content feeds, so I merely check for the
presense of the dtd. In future we should extract the
supported content encodings from the returned DTD.
Original comment by Brett.Mi...@gmail.com
on 6 Oct 2009 at 9:17
Original comment by jl1615@gmail.com
on 27 Oct 2009 at 11:05
Original issue reported on code.google.com by
Brett.Mi...@gmail.com
on 26 May 2009 at 4:39