Support multiple HDFS backends

anjackson commented 3 years ago

Allow multiple HDFS services, each using different IDs in TrackDB. e.g.

    "id":"hdfs://hdfs:54310/1_data/npld/webrecorder/bl-your_stories/warcs/www.bl.uk-20150814093821.warc.gz",

We add a configuration map that converts the HDFS URL into the appropriate WebHDFS one:

{
    'hdfs://hdfs:54310': 'http://hdfs.api.wa.bl.uk/webhdfs/v1',
    'hdfs://h3nn:8020': 'http://hdfs.h3.api.wa.bl.uk/webhdfs/v1'
}

...as a JSON-encoded environment variable:

    WEBHDFS_MAP = "{'hdfs://hdfs:54310':'http://hdfs.api.wa.bl.uk/webhdfs/v1','hdfs://h3nn:8020': 'http://hdfs.h3.api.wa.bl.uk/webhdfs/v1'}"

Currently, the code that looks up files in TrackDB only returns the path prefixed with hdfs:

https://github.com/ukwa/ukwa-warc-server/blob/d8f514cb53dfc369b8e8411d8a212fd362e7a1ee/warcserver/file_finder.py#L62-L77

So this should be changed to use the id rather than the path.

The code that works out where to get the WARC data only supports the one hard-coded hdfs: prefix:

https://github.com/ukwa/ukwa-warc-server/blob/d8f514cb53dfc369b8e8411d8a212fd362e7a1ee/warc_server.py#L44-L45

This will need to be changed to pass the ID. The from_webhdfs function can them check the ID URL against WEBHDFS_MAP and then substitute the hdfs://... prefix for the WebHDFS http://... prefix and make the call. If the prefix isn't known, it can raise an error.

Of course, this needs TrackDB to get updated properly from two HDFS services, using the correct prefixes.

anjackson commented 3 years ago

Having thought about it, I think it's much easier to get the process that populates TrackDB to put the right WebHDFS endpoint information directly into the TrackDB record. This client software then just has to pick it up and use it.

anjackson commented 3 years ago

Implemented in 7c60763b96f105b501a3f7b273307e49da9622af.

ukwa / ukwa-warc-server

Support multiple HDFS backends #8