helgeho / ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
MIT License
144 stars 19 forks source link

saveAsWarc with same warcPaths as the source #12

Closed dportabella closed 6 years ago

dportabella commented 6 years ago

This code produces a .cdx and .warc file.

ArchiveSpark.load(sc, WarcCdxHdfsSpec(cdxPath = "/data/example.cdx.gz", warcPath = "/data"))
.filter(r => r.surtUrl.startsWith("com,example"))
.saveAsWarc("/tmp/filtered_warc", WarcMeta(), generateCdx = true)

If the CDX file points to:

./example_warc1.warc.gz
./example_warc2.warc.gz
./example_warc3.warc.gz

would it be possible to modify the saveAsWarc function to generate these 3 archives (with the filtered records)?

/tmp/filtered_warc/example_warc1.warc
/tmp/filtered_warc/example_warc2.warc
/tmp/filtered_warc/example_warc3.warc

/tmp/filtered_warc/example_warc1.warc would contain the filtered records from ./example_warc1.warc.gz, and so on.

helgeho commented 6 years ago

This depends on the CDX data, if you have one CDX file per WARC file and you do not shuffle the data around, it will be exactly like this (I recommend to add a .gz extension the destination directory, that will make sure the output is compressed in WARC.gz format)

dportabella commented 6 years ago

great, thx!

dportabella commented 6 years ago

I see, there will be a correspondence between input and oputput files. However, it will save the warc files with WarcMeta.filename = WarcMeta.name + index, and so the input name will be lost. I am thinking about CommonCrawl archives, with input names such as:

crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00799.warc.gz
crawl-data/CC-MAIN-2018-05/segments/1516084886397.2/warc/CC-MAIN-20180116090056-20180116110056-00000.warc.gz

Would it be possible to preserve file and path names?