inab / WfExS-backend

Workflow Execution Service Backend
Apache License 2.0
16 stars 6 forks source link

Add metadata related to fetched URIs #1

Closed jmfernandez closed 3 years ago

jmfernandez commented 3 years ago

Right now WfExS does not keep a correspondence between URLs and downloaded files, as the filenames are hashes generated from the URL. But there are several scenarios where additional upstream metadata is available, and future cases where a single URL corresponds to a collection of files. An example of this last one, an ENCODE Experiment id or EGA dataset id correspond to more than one file, maybe with their independent download URL.

So, there should be an intermediate metadata layer, where these correspondences and upstream metadata are kept. After this change, name of cached files should be the sha256 of their content, and URIs should translate to JSON files named as the hash of the URI, containing the correspondences to cached files, and their origins.

Last, but not the least important, upstream metadata should be gathered and preserved in the execution provenance

jmfernandez commented 3 years ago

Fixed on commits from 8c42d54600cd182cf9a15ff76eb247b7b0e46243 to a9352d8ffef5f88537f8b2774dc0593f91014a98 , and tested in faea4ccd9d9c13279aaeed644f90803956c12977