ukwa / ukwa-services

Deployment configuration for all UKWA services stacks.
Apache License 2.0
4 stars 5 forks source link

Update CDX Indexing to handle PyWB-style indexes, OPTIONS/HEAD/POST with parameters. #34

Open anjackson opened 3 years ago

anjackson commented 3 years ago

To resolve some complex playback issues (Twitter, HuffPo) we need to be able to play back POST requests.

This requires some coordination with Ilya as he's been changing how he does it.

Once the indexing scheme is stable, we need to use a version of OutbackCDX that supports it, and re-index the CDX data (at least the last couple of years).


Updating the Java stack is quite involved: https://github.com/ukwa/webarchive-discovery/issues/244

Might be time to switch to Python for this MR Job. Use PyWB indexer and POST them to OutbackCDX.

Also need OutbackCDX 0.8.0 to handle the lookups properly.

Some other examples of similar code:

MrJob

Using mapper_raw means MrJob arranges for a copy of each WARC to be placed where we can get to it: (This breaks data locality, but streaming through large files is not performant because they get read into memory) (A FileInputFormat that could reliably split block GZip files would be the only workable fix) (But TBH this is pretty fast as it is)

Play with a WARC processor with https://pypi.org/project/boilerpy3/ and e.g. Spacy

See https://github.com/ukwa/ukwa-hadoop-tasks/tree/master/warc_indexing

anjackson commented 2 years ago

On POST request handling, see https://github.com/webrecorder/replayweb.page/issues/69

anjackson commented 2 years ago

Going to have to defer this as it's still unclear what to do. PyWB-based indexing does work, but these specific issues remain unresolved.