lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Memory Issues on Large WARC Files #254

Open ianmilligan1 opened 7 years ago

ianmilligan1 commented 7 years ago

I've been tinkering around with @dportabella's #246 issue, as we also have some very large WARCs in a collection (i.e. some of 7GB, others of 40,50,60GB). We do run into Java Heap Space issues w/ large WARC files.

Most of our development has focused on standard-size Archive-It files, i.e. ~ 1 GB, but looks like there are lots of larger ones out there.

Is there any tweak we can make to loadArchives to better parse large WARC files?