iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
50 stars 71 forks source link

Fixing bad dates in WARC file #80

Closed cjer closed 6 years ago

cjer commented 6 years ago

I'm dealing with a WARC archive that was transformed from ARC files. When I was running Archives Unleashed Toolkit on it I ran into issues with unparseable dates. It is described here https://github.com/archivesunleashed/aut/issues/163.

I have fixed the dates from their YYYYmmddHHMM format to a proper ISO-8601. But now I'm getting unexpected extra data after record org.archive.io.warc.WARCRecord errors for all those files. Any suggestions on what else needs to be fixed?

anjackson commented 6 years ago

I know you can't share the resultant files, so this is going to be tricky.

Open option is to run them through JWATTools or warcio. In particular, JWAT is somewhat stricter than the webarchive-commons parser, and may produce more debugging information.

My guess would be that a few characters have somehow got lost from the WARC records, and that causes the WARC parser to overshoot the start of the next record. (I'm assuming you are not using block-compressed WARC.GZ files because you'd have trouble editing the dates.)

Is there any chance that, in editing the records, the line endings of the WARC files have been modified?

(for my reference, the code that raises this warning is here)

cjer commented 6 years ago

OK, I think you found the direction with the block compression thing. To fix the dates, all I did was this:

for f in $(cat bad_date_files.txt);
do
    echo $f
    gunzip < "$f" | sed -e '/WARC-Date: \([0-9]\{4\}\)\([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{2\}\)/ s//WARC-Date: \1-\2-\3T\4:\5:00Z/g' | gzip -c > fixed$f
done

Which I then tested with diff and saw all is good, and nothing was changed in the .warc besides the dates.

But, now that I look at the .warc.gz files diff, I see they are completely different. Namely, the fixed file is considerably smaller, which is probably due to a different compression. How was I supposed to compress it?

cjer commented 6 years ago

JWAT output before fix:

Showing errors: true
Validate digest: true
Using 1 thread(s).
Output Thread started.
ThreadPool started.
Queued 1 file(s).
ThreadPool shut down.
Output Thread stopped.
#
# Job summary
#
GZip files: 0
  +  Arc: 0
  + Warc: 1
 Arc files: 0
Warc files: 0
    Errors: 4
  Warnings: 0
RuntimeErr: 0
   Skipped: 0
      Time: 00:00:08 (8632 ms.)
TotalBytes: 101.9 mb
  AvgBytes: 12.7 mb/s
INVALID_EXPECTED: 2
REQUIRED_INVALID: 2
'WARC-Date' header: 2
'WARC-Date' value: 2

JWAT output after fix:

Showing errors: true
Validate digest: true
Using 1 thread(s).
Output Thread started.
ThreadPool started.
Queued 1 file(s).

Queued: 1 - Processed: 1 - 7.1 mb/s - Estimated: --:--:-- (100.00%).

ThreadPool shut down.
Output Thread stopped.
#
# Job summary
#
GZip files: 0
  +  Arc: 0
  + Warc: 1
 Arc files: 0
Warc files: 0
    Errors: 7921
  Warnings: 0
RuntimeErr: 0
   Skipped: 0
      Time: 00:00:15 (15860 ms.)
TotalBytes: 99.5 mb
  AvgBytes: 6.6 mb/s
INVALID: 7920
UNDESIRED_DATA: 1
Data before WARC version: 3960
Empty lines before WARC version: 3960
Trailing data: 1
anjackson commented 6 years ago

The WARC.GZ format uses multiple concatented GZip blocks (called 'members' in the spec.) so individual records can be recovered quickly. But most tools (e.g.) will just just compress files into one chunk (and transparently concatenate multiple members when uncompressing).

To make a gzip-file-member-per-WARC-record you need a tool that knows about the format. Both warcio compress and IIRC jwattools compress can be used to correctly create a WARC.GZ from a WARC.

cjer commented 6 years ago

That makes a lot of sense. Thanks Andy!

cjer commented 6 years ago

jwattools compress did it 👍