lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Ingestion bug in copyStream, wrong number of bytes expected #62

Open lintool opened 10 years ago

lintool commented 10 years ago
14/08/10 09:13:18 ERROR ingest.IngestFiles: Error ingesting file: /scratch0/webarchive/congress108/arc.sample/CONGRESS01-20040124072939-193.arc.gz
java.io.IOException: Read 394 but expected 439
        at org.warcbase.ingest.IngestFiles.copyStream(IngestFiles.java:63)
        at org.warcbase.ingest.IngestFiles.ingestArcFile(IngestFiles.java:102)
        at org.warcbase.ingest.IngestFiles.ingestFolder(IngestFiles.java:163)
        at org.warcbase.ingest.IngestFiles.main(IngestFiles.java:220)
lintool commented 10 years ago

Current fix is to catch exception and move on. https://github.com/lintool/warcbase/commit/a00e413edff46d4655fe621b65d1af89ffda33c4

Might be worth looking in detail a bit more on what's going on at a later point in time.