iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
49 stars 72 forks source link

Store origin-code in ARCRecord header #52

Closed jrwiebe closed 8 years ago

jrwiebe commented 8 years ago

It would be useful for our purposes if the origin-code from the version block of an ARC file were stored and made accessible by a method in ARCRecordMetaData. In my fork this method is called getOrigin().

anjackson commented 8 years ago

Looks good to me, although I don't know if the origin field is always present. Maybe @kris-sigur or @johnerikhalse have more experience of ARC files?

jrwiebe commented 8 years ago

According to the ARC format grammar origin-code is required, but I have no idea if all ARC-generating tools respect this: https://archive.org/web/researcher/ArcFileFormat.php

kris-sigur commented 8 years ago

Backwards compatibility is very important when dealing with ARCs. Historically, there have been all types of "mangled" ARCs as different tools have come and gone. I don't know if that applies to this origin field, but at a minimum the code should gracefully handle its absence so that we do not introduce a regression.

Or to put it another way, any ARC (even technically invalid according to the spec) that could be parsed before this change, should still parse after this change. Which is clearly not the case.

anjackson commented 8 years ago

Having said all that, I think I'd accept this change as long as it checked there was a third element in that list before get()ing it.

kris-sigur commented 8 years ago

@anjackson Yes. That was substantially what I meant. Perhaps also adding a note to the Javadoc for the ORIGIN_FIELD_KEY constant about how this field has been (and still is!) incorrectly hardcoded by this library so the value should be used with care.

jrwiebe commented 8 years ago

I was about to add a check as per @anjackson's suggestion and discovered this is actually already done by getTokenizedHeaderLine.

anjackson commented 8 years ago

Excellent. So, if you add a 1.1.7 section and a note about this change to the CHANGES.md then I think we're good to go.