lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Detect WARC or ARC format when loading Records #195

Closed bitzl closed 8 years ago

bitzl commented 8 years ago

Most of our harvested data is stored in WARC files, but some older is still there as ARC. When we try to perform an analysis using

RecordLoader.loadWarc("/path/to/*/arcs", sc)

it will fail when it reaches the first ARC file:

[...]
Caused by: java.io.IOException: Failed to find WARC MAGIC: 1 1 InternetArchive
[...]

To support analysis mixed data, please add a new method RecordLoader.loadArcOrWarc(source, sc) which is identical to RecordLoader.loadArc(source, sc) and RecordLoader.loadWarc(source, sc), but decides for each file if it is treated as ARC or WARC file.

To distinguish ARC and WARC files one could use the file extensions (.arc and .arc.gz for ARC, .warc and .warc.gz for WARC).

jrwiebe commented 8 years ago

We haven't forgotten about this, @bitzl. I hope to tackle it in the not-too-distant future.

bitzl commented 8 years ago

Thanks @jrwiebe :-)

jrwiebe commented 8 years ago

There you go!