icgc-dcc / dcc-portal

Data portal for exploring and accessing data
https://dcc.icgc.org/
Other
15 stars 8 forks source link

For data files transferred from EGA to Collab, add EGA repo to 'File copy' section of the portal file index #460

Open junjun-zhang opened 6 years ago

junjun-zhang commented 6 years ago

Here is one such file: https://dcc.icgc.org/repositories/files/FI743257. It is originated from EGA, we transferred to Collaboratory, but the file page only shows this file exist in Collab but not in EGA.

We need a way to let the portal repo indexer know addition copy of the file exists in EGA as well. This could be as easy as detecting whether dataBundleId starts with EGA, if so, there must be a copy of the file exist in EGA.

We may also need additional EGA specific information for the file copy, such as repoFileId, in this case, we need EGAFxxxxx ID to be populated, so will need a way to pass it to indexer.

junjun-zhang commented 6 years ago

@baminou can you please mirror all of the EGA file transfer git repositories hosted internally under http://142.1.177.124/jt-hub to the public github repo under https://github.com/icgc-dcc/?

There should be 5 or 6 of them.

junjun-zhang commented 6 years ago

The goal is to add a new fileCopy entry for files already transferred from EGA to Collaboratory. We first identify the SONG Analysis in Collab, this can be done by Analysis ID. The ID takes form of EGAZ000000000 or EGZR000000000. Then we will need two fields specific to EGA fileCopy entry: repoDataSetIds and repoFileId. Below we give detail how values for these fields can be found.

The git repos for ega transfer jobs have been mirrored from our internal server to GitHub:

The files contained needed information are:

1. Job JSON file with EGA Dataset ID, pattern for file path/name: ega-file-transfer-to-collab-*-jtracker/blob/master/ega-file-transfer-to-collab.*.jtracker/job_state.completed/job.*/job.*.json. Fields: 'bundle_id' (use as 'repoDataBundleId', same as SONG's Analysis ID), 'ega_dataset_id' (use as 'repoDataSetIds').
2. Uploaded data file to Collab, pattern for file path/name: ega-file-transfer-to-collab-*-jtracker/blob/master/ega-file-transfer-to-collab.*.jtracker/job_state.completed/job.*/task_state.completed/worker.*/task.upload.*/task.upload.{ega_file_id}.json. Use the value {ega_file_id} in the file name as 'repoFileId'

Example files:

junjun-zhang commented 6 years ago

Just to give two examples here:

rosibaj commented 5 years ago

EGA indexing to be investigated in future; do not work on this until we know more about those specs, OR until we have more EGA data to transfer to collaboratory.