Closed dportabella closed 6 years ago
This depends on the CDX data, if you have one CDX file per WARC file and you do not shuffle the data around, it will be exactly like this (I recommend to add a .gz extension the destination directory, that will make sure the output is compressed in WARC.gz format)
great, thx!
I see, there will be a correspondence between input and oputput files. However, it will save the warc files with WarcMeta.filename = WarcMeta.name + index
, and so the input name will be lost. I am thinking about CommonCrawl archives, with input names such as:
crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00799.warc.gz
crawl-data/CC-MAIN-2018-05/segments/1516084886397.2/warc/CC-MAIN-20180116090056-20180116110056-00000.warc.gz
Would it be possible to preserve file and path names?
This code produces a .cdx and .warc file.
If the CDX file points to:
would it be possible to modify the saveAsWarc function to generate these 3 archives (with the filtered records)?
/tmp/filtered_warc/example_warc1.warc
would contain the filtered records from./example_warc1.warc.gz
, and so on.