While working on streamlining PIS, particularly in the target step, I've come up with a discrepancy in the standard directory structure for data used by this step.
The data fetched from buckets by PIS for this step are: Essentiality, subcellularLocations, hpa, hallmarks, TEPs, ChemicalProbes, TargetSafety, Tractability.
PIS does select and download the file with a latest creation date in a path. In all cases, the files are all in a directory. ChemicalProbes is a good example.
Discussing this with @ireneisdoomed yesterday, she said this is due to the data being uploaded by a third party. Also, if there is no data in the current release subfolder, we should fall back to a previous one.
Given this structure is only happening in this folder, it would be good to assess if it can be flattened, removing those subfolders. This would save having to add custom logic into PIS for the retrieval of tractability data.
The files themselves already have the release number as part of their name, so no information would be lost. Besides, the manifest @mbdebian put in place in PIS is a great idea as a source of truth as to which files have gone into a particular release.
While working on streamlining PIS, particularly in the
target
step, I've come up with a discrepancy in the standard directory structure for data used by this step.The data fetched from buckets by PIS for this step are:
Essentiality
,subcellularLocations
,hpa
,hallmarks
,TEPs
,ChemicalProbes
,TargetSafety
,Tractability
.PIS does select and download the file with a latest creation date in a path. In all cases, the files are all in a directory. ChemicalProbes is a good example.
However, for
Tractability
, the data is split into subfolders named as the releases.Discussing this with @ireneisdoomed yesterday, she said this is due to the data being uploaded by a third party. Also, if there is no data in the current release subfolder, we should fall back to a previous one.
Given this structure is only happening in this folder, it would be good to assess if it can be flattened, removing those subfolders. This would save having to add custom logic into PIS for the retrieval of tractability data.
The files themselves already have the release number as part of their name, so no information would be lost. Besides, the manifest @mbdebian put in place in PIS is a great idea as a source of truth as to which files have gone into a particular release.
Let me know what you think!