Closed anjackson closed 3 years ago
Having thought about it, I think it's much easier to get the process that populates TrackDB to put the right WebHDFS endpoint information directly into the TrackDB record. This client software then just has to pick it up and use it.
Implemented in 7c60763b96f105b501a3f7b273307e49da9622af.
Allow multiple HDFS services, each using different IDs in TrackDB. e.g.
We add a configuration map that converts the HDFS URL into the appropriate WebHDFS one:
...as a JSON-encoded environment variable:
Currently, the code that looks up files in TrackDB only returns the path prefixed with
hdfs:
https://github.com/ukwa/ukwa-warc-server/blob/d8f514cb53dfc369b8e8411d8a212fd362e7a1ee/warcserver/file_finder.py#L62-L77
So this should be changed to use the
id
rather than the path.The code that works out where to get the WARC data only supports the one hard-coded
hdfs:
prefix:https://github.com/ukwa/ukwa-warc-server/blob/d8f514cb53dfc369b8e8411d8a212fd362e7a1ee/warc_server.py#L44-L45
This will need to be changed to pass the ID. The
from_webhdfs
function can them check the ID URL againstWEBHDFS_MAP
and then substitute thehdfs://...
prefix for the WebHDFShttp://...
prefix and make the call. If the prefix isn't known, it can raise an error.Of course, this needs
TrackDB
to get updated properly from two HDFS services, using the correct prefixes.