Closed cjer closed 6 years ago
I know you can't share the resultant files, so this is going to be tricky.
Open option is to run them through JWATTools or warcio. In particular, JWAT is somewhat stricter than the webarchive-commons parser, and may produce more debugging information.
My guess would be that a few characters have somehow got lost from the WARC records, and that causes the WARC parser to overshoot the start of the next record. (I'm assuming you are not using block-compressed WARC.GZ files because you'd have trouble editing the dates.)
Is there any chance that, in editing the records, the line endings of the WARC files have been modified?
(for my reference, the code that raises this warning is here)
OK, I think you found the direction with the block compression thing. To fix the dates, all I did was this:
for f in $(cat bad_date_files.txt);
do
echo $f
gunzip < "$f" | sed -e '/WARC-Date: \([0-9]\{4\}\)\([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{2\}\)/ s//WARC-Date: \1-\2-\3T\4:\5:00Z/g' | gzip -c > fixed$f
done
Which I then tested with diff
and saw all is good, and nothing was changed in the .warc besides the dates.
But, now that I look at the .warc.gz files diff, I see they are completely different. Namely, the fixed file is considerably smaller, which is probably due to a different compression. How was I supposed to compress it?
JWAT output before fix:
Showing errors: true
Validate digest: true
Using 1 thread(s).
Output Thread started.
ThreadPool started.
Queued 1 file(s).
ThreadPool shut down.
Output Thread stopped.
#
# Job summary
#
GZip files: 0
+ Arc: 0
+ Warc: 1
Arc files: 0
Warc files: 0
Errors: 4
Warnings: 0
RuntimeErr: 0
Skipped: 0
Time: 00:00:08 (8632 ms.)
TotalBytes: 101.9 mb
AvgBytes: 12.7 mb/s
INVALID_EXPECTED: 2
REQUIRED_INVALID: 2
'WARC-Date' header: 2
'WARC-Date' value: 2
JWAT output after fix:
Showing errors: true
Validate digest: true
Using 1 thread(s).
Output Thread started.
ThreadPool started.
Queued 1 file(s).
Queued: 1 - Processed: 1 - 7.1 mb/s - Estimated: --:--:-- (100.00%).
ThreadPool shut down.
Output Thread stopped.
#
# Job summary
#
GZip files: 0
+ Arc: 0
+ Warc: 1
Arc files: 0
Warc files: 0
Errors: 7921
Warnings: 0
RuntimeErr: 0
Skipped: 0
Time: 00:00:15 (15860 ms.)
TotalBytes: 99.5 mb
AvgBytes: 6.6 mb/s
INVALID: 7920
UNDESIRED_DATA: 1
Data before WARC version: 3960
Empty lines before WARC version: 3960
Trailing data: 1
The WARC.GZ format uses multiple concatented GZip blocks (called 'members' in the spec.) so individual records can be recovered quickly. But most tools (e.g.) will just just compress files into one chunk (and transparently concatenate multiple members when uncompressing).
To make a gzip-file-member-per-WARC-record you need a tool that knows about the format. Both warcio compress
and IIRC jwattools compress
can be used to correctly create a WARC.GZ from a WARC.
That makes a lot of sense. Thanks Andy!
jwattools compress
did it 👍
I'm dealing with a WARC archive that was transformed from ARC files. When I was running Archives Unleashed Toolkit on it I ran into issues with unparseable dates. It is described here https://github.com/archivesunleashed/aut/issues/163.
I have fixed the dates from their YYYYmmddHHMM format to a proper ISO-8601. But now I'm getting
unexpected extra data after record org.archive.io.warc.WARCRecord
errors for all those files. Any suggestions on what else needs to be fixed?