Open junjun-zhang opened 6 years ago
@baminou can you please mirror all of the EGA file transfer git repositories hosted internally under http://142.1.177.124/jt-hub to the public github repo under https://github.com/icgc-dcc/?
There should be 5 or 6 of them.
The goal is to add a new fileCopy entry for files already transferred from EGA to Collaboratory. We first identify the SONG Analysis in Collab, this can be done by Analysis ID. The ID takes form of EGAZ000000000
or EGZR000000000
. Then we will need two fields specific to EGA fileCopy entry: repoDataSetIds
and repoFileId
. Below we give detail how values for these fields can be found.
The git repos for ega transfer jobs have been mirrored from our internal server to GitHub:
The files contained needed information are:
1. Job JSON file with EGA Dataset ID, pattern for file path/name: ega-file-transfer-to-collab-*-jtracker/blob/master/ega-file-transfer-to-collab.*.jtracker/job_state.completed/job.*/job.*.json. Fields: 'bundle_id' (use as 'repoDataBundleId', same as SONG's Analysis ID), 'ega_dataset_id' (use as 'repoDataSetIds').
2. Uploaded data file to Collab, pattern for file path/name: ega-file-transfer-to-collab-*-jtracker/blob/master/ega-file-transfer-to-collab.*.jtracker/job_state.completed/job.*/task_state.completed/worker.*/task.upload.*/task.upload.{ega_file_id}.json. Use the value {ega_file_id} in the file name as 'repoFileId'
Example files:
Just to give two examples here:
repoDataBundleId
, repoFileId
, repoDataSetIds
EGA indexing to be investigated in future; do not work on this until we know more about those specs, OR until we have more EGA data to transfer to collaboratory.
Here is one such file: https://dcc.icgc.org/repositories/files/FI743257. It is originated from EGA, we transferred to Collaboratory, but the file page only shows this file exist in Collab but not in EGA.
We need a way to let the portal repo indexer know addition copy of the file exists in EGA as well. This could be as easy as detecting whether
dataBundleId
starts withEGA
, if so, there must be a copy of the file exist in EGA.We may also need additional EGA specific information for the file copy, such as
repoFileId
, in this case, we needEGAFxxxxx
ID to be populated, so will need a way to pass it to indexer.