iipc / openwayback

The OpenWayback Development
http://www.netpreserve.org/openwayback
Apache License 2.0
473 stars 271 forks source link

Set a value for compressed record length via WARCRecordToSearchResultAdapter.java #272

Open ldko opened 9 years ago

ldko commented 9 years ago

Gina at LOC is trying to generate CDX files of the format ' CDX N b a m s k r M S V g'. Currently the S field for compressed length is not being set, rather it gets the default value of '-'.

From what I follow in what is happening when running the cdx-indexer with an attempt to get S field: https://github.com/iipc/openwayback/blob/master/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResult.java#L288 is getting null so it is setting the compressed length to -1.

https://github.com/iipc/openwayback/blob/master/wayback-core/src/main/java/org/archive/wayback/resourceindex/cdx/format/CompressedLengthCDXField.java#L40 is getting -1 each time and thus is writing the DEFAULT_VALUE of "-".

In genericResult of WARCRecordToSearchResultAdapter.java there could be a result.setCompressedLength(compressed_record_length); to fix the issue. I am not sure of the best way to get compressed_record_length here.

gmj2053 commented 9 years ago

We would like the field if at all possible. Thanks

kris-sigur commented 9 years ago

Isn't the entire record (including the WARC header) compressed as a single entity? If so, compressed record length is meaningless unless you are are defining a record as including the WARC header. Is that what you want @gmj2053 ?

anjackson commented 9 years ago

BTW, here's what I posted on the mailing list:


It looks like Ilya changed the Indexer at this point in time: https://github.com/iipc/openwayback/commit/b8315edb700e5d320ee053848d49993ff235c609

However, as far as I can tell, the 'S' field has never been populated correctly. I can find almost no references to populating it. The only one I can find during ARC or WARC parsing is this one:

https://github.com/iipc/openwayback/blob/600cd7d6545787bb4505285a059e89147e214874/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/ARCRecordToSearchResultAdapter.java#L98

Do you need this field? Do you want a different CDX format?


Note that the system appears to support this for ARC files, but not for WARCs.

ldko commented 9 years ago

Though, in the code you linked, it looks like it is only set for dns records. I ran the cdx-indexer on a compressed ARC file and the S field value is written as '0' for the one dns record and '-' for all the others.

gmj2053 commented 9 years ago

We had a meeting today (ITS/Webarchiving) and discussed the need for/usefulness of this field in making our content available. Right now, there really isn't a thought on what could be done with the field, it was more of an exploration item for better understanding the web archives. So, we are ok not using this field, but the corollary is that if there aren't any values generated for that field, perhaps it needs to be suppressed with the out-of-the box indexer? I've not come across much in configuring the (open)wayback cdx-indexer, does anyone have a good resource that they can point us off to? thanks for all the help on this.

ldko commented 9 years ago

I didn't find much on cdx-indexer configuration other than what is in the code.

The USAGE as written via IndexWorker.java also seems somewhat out of date:

USAGE:

cdx-indexer [-format FORMAT|-identity] FILE
cdx-indexer [-format FORMAT|-identity] FILE CDXFILE

Create a CDX format index from ARC or WARC file
FILE at CDXFILE or to STDOUT.
With -identity, perform no url canonicalization.
With -format, output CDX in format FORMAT.

which is missing -new-canon-classic and -new-canon-surt flags and could say something of defaults. If it is desired, I can open an issue and update the usage given.

@gmj2053 If you want to generate your CDX files SURT formatted without the S field, you can do so by supplying a different format (here excluding the S): bin/cdx-indexer -new-canon-surt -format ' CDX N b a m s k r M V g' <ARCHIVE-FILE> <CDX-FILE>

kris-sigur commented 9 years ago

It sounds like this feature was never implemented (at least in the Java based Wayback, perhaps the old, proprietary Wayback used this) . Moreover, looking at the CDX legend specification on IA's website I can't find S. So it seems like it has been deprecated.

I suggest we eliminate this field from CDX generation. For 2.3.0 any CDX generation that includes it would trigger a warning. And in 3.0.0 it would simply be removed.

OpenWayback should still read CDXs that contain this field, ignoring its content as is currently the case.

Withdrawn, see next comment from Ilya.

ikreymer commented 8 years ago

That is not at all correct. IA uses the cdx generator in webarchive-commons, which has been standardized for the 11-field format CDX N b a m s k r M S V g: https://github.com/iipc/webarchive-commons/blob/master/src/main/java/org/archive/extract/RealCDXExtractorOutput.java The CDX legend spec is very old and lacking the S field.

The compressed length field is useful because it sets the size of the record that much be fetched. This can be important when reading from a remote source, for example over HTTP, as it means wayback can make a bounded range request rather than an unbounded one to fetch the WARC/ARC record. (Less important if the files are on local disk). Still, knowing the length as well as the offset of the record can be very useful.

kngenie commented 8 years ago

Just FYI, Internet Archive no longer uses cdx generator in webarchive-commons. We're using CDX-Writer instead. It writes out S field. As Ilya said, S field is a relatively-recent addition, and the specification document has never been updated to incorporate it.

anjackson commented 8 years ago

I've put some rough notes on all this here: http://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/ (generated from https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/cdx-format/cdx-2015/index.md).

johnerikhalse commented 8 years ago

Good work Andy.

This should probably be followed up by a revision of the CDX-format:

Anybody want to volunteer to make a draft?