lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Error handling for broken ARC/WARC files #234

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

In many collections, we end up with java.lang.NegativeArraySizeException errors. This is probably because warcbase is expecting X bytes in a given ARC or WARC file but then encounters Y, throwing it off ad nauseum.

We should build more robust error handling, perhaps just skipping the broken ARC/WARC and letting us know what file was skipped...

jrwiebe commented 8 years ago

Is this in reference to the errors described in #222? If so, @anjackson's comment offers a way forward.

anjackson commented 8 years ago

You'll probably hit other oddities that mean you need to be able to skip records -- we certainly have!

ianmilligan1 commented 8 years ago

Yes and yes, I think.

lintool commented 8 years ago

I tracked down the issue yesterday - for some ARC records, the body content just isn't present for whatever reason (crawler glitch?). The headers there going to be n bytes, but the content doesn't appear, so the parser freaks out. I'm catching the exception so this doesn't croak the entire job.

ianmilligan1 commented 8 years ago

We're still running into the problem on a collection of WARC files collected by University of Alberta.

Gist error dump can be found here.

Maybe worth pushing the ARC changes into the WARC handler too?

lintool commented 8 years ago

Ah yes - I would except the same issue w/ WARCs as well. I'm in the middle of doing the sub-artifact conversion, which involves a lot of code movement. Let me work on this after everything has stabilized?

ianmilligan1 commented 8 years ago

Sounds good!

lintool commented 8 years ago

@ianmilligan1 Can you somehow give me access to the collection of WARCs that are breaking? I.e., either scp to trantor (stage on camalon) or copy to rho? Or give me access to whatever machine it's on. Will make it much easier to debug. I have a pretty good idea what's causing the issue, but I need to reproduce the error to decide on the best fix.

ianmilligan1 commented 8 years ago

Doing so – transfer is a bit slow but should be done by morning, I think.

lintool commented 8 years ago

Fixed. commit 16db93460896da80dbc4c07f27b0294bb618e291