Open ldko opened 9 years ago
We would like the field if at all possible. Thanks
Isn't the entire record (including the WARC header) compressed as a single entity? If so, compressed record length is meaningless unless you are are defining a record as including the WARC header. Is that what you want @gmj2053 ?
BTW, here's what I posted on the mailing list:
It looks like Ilya changed the Indexer at this point in time: https://github.com/iipc/openwayback/commit/b8315edb700e5d320ee053848d49993ff235c609
However, as far as I can tell, the 'S' field has never been populated correctly. I can find almost no references to populating it. The only one I can find during ARC or WARC parsing is this one:
Do you need this field? Do you want a different CDX format?
Note that the system appears to support this for ARC files, but not for WARCs.
Though, in the code you linked, it looks like it is only set for dns records. I ran the cdx-indexer on a compressed ARC file and the S field value is written as '0' for the one dns record and '-' for all the others.
We had a meeting today (ITS/Webarchiving) and discussed the need for/usefulness of this field in making our content available. Right now, there really isn't a thought on what could be done with the field, it was more of an exploration item for better understanding the web archives. So, we are ok not using this field, but the corollary is that if there aren't any values generated for that field, perhaps it needs to be suppressed with the out-of-the box indexer? I've not come across much in configuring the (open)wayback cdx-indexer, does anyone have a good resource that they can point us off to? thanks for all the help on this.
I didn't find much on cdx-indexer configuration other than what is in the code.
The USAGE as written via IndexWorker.java also seems somewhat out of date:
USAGE:
cdx-indexer [-format FORMAT|-identity] FILE
cdx-indexer [-format FORMAT|-identity] FILE CDXFILE
Create a CDX format index from ARC or WARC file
FILE at CDXFILE or to STDOUT.
With -identity, perform no url canonicalization.
With -format, output CDX in format FORMAT.
which is missing -new-canon-classic
and -new-canon-surt
flags and could say something of defaults. If it is desired, I can open an issue and update the usage given.
@gmj2053 If you want to generate your CDX files SURT formatted without the S field, you can do so by supplying a different format (here excluding the S):
bin/cdx-indexer -new-canon-surt -format ' CDX N b a m s k r M V g' <ARCHIVE-FILE> <CDX-FILE>
It sounds like this feature was never implemented (at least in the Java based Wayback, perhaps the old, proprietary Wayback used this) . Moreover, looking at the CDX legend specification on IA's website I can't find S. So it seems like it has been deprecated.
I suggest we eliminate this field from CDX generation. For 2.3.0 any CDX generation that includes it would trigger a warning. And in 3.0.0 it would simply be removed.
OpenWayback should still read CDXs that contain this field, ignoring its content as is currently the case.
Withdrawn, see next comment from Ilya.
That is not at all correct. IA uses the cdx generator in webarchive-commons, which has been standardized for the 11-field format CDX N b a m s k r M S V g
:
https://github.com/iipc/webarchive-commons/blob/master/src/main/java/org/archive/extract/RealCDXExtractorOutput.java
The CDX legend spec is very old and lacking the S field.
The compressed length field is useful because it sets the size of the record that much be fetched. This can be important when reading from a remote source, for example over HTTP, as it means wayback can make a bounded range request rather than an unbounded one to fetch the WARC/ARC record. (Less important if the files are on local disk). Still, knowing the length as well as the offset of the record can be very useful.
Just FYI, Internet Archive no longer uses cdx generator in webarchive-commons. We're using CDX-Writer instead. It writes out S
field.
As Ilya said, S
field is a relatively-recent addition, and the specification document has never been updated to incorporate it.
I've put some rough notes on all this here: http://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/ (generated from https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/cdx-format/cdx-2015/index.md).
Good work Andy.
This should probably be followed up by a revision of the CDX-format:
'c' old style checksum
and 'k' new style checksum
.fl
-, filter
- and collapse
-parameters. E.g. urlkey
is the long name for 'N'
.Anybody want to volunteer to make a draft?
Gina at LOC is trying to generate CDX files of the format ' CDX N b a m s k r M S V g'. Currently the S field for compressed length is not being set, rather it gets the default value of '-'.
From what I follow in what is happening when running the cdx-indexer with an attempt to get S field: https://github.com/iipc/openwayback/blob/master/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResult.java#L288 is getting null so it is setting the compressed length to -1.
https://github.com/iipc/openwayback/blob/master/wayback-core/src/main/java/org/archive/wayback/resourceindex/cdx/format/CompressedLengthCDXField.java#L40 is getting -1 each time and thus is writing the DEFAULT_VALUE of "-".
In genericResult of WARCRecordToSearchResultAdapter.java there could be a
result.setCompressedLength(compressed_record_length);
to fix the issue. I am not sure of the best way to get compressed_record_length here.