nih-cfde / cfde-deriva

Collaboration point for miscellaneous CFDE-deriva scripts
Other
2 stars 3 forks source link

New export menu item for cavatica scenario #334

Closed karlcz closed 2 years ago

karlcz commented 2 years ago

A simplified, browser-based scenario has been specified:

  1. User exports a flat manifest file to client system from portal UI
  2. User imports manifest file to Cavatica via its web UI

For this we need a new export menu option on the personal collection (and file?) table:

~If Cavatica requires any special default values rather than blank fields for metadata we do not always have (i.e. nullable C2M2 fields), we might also need a transformer module in the export service to rewrite the NULL/blank values in the response from the catalog.~

RLC-DCPPC commented 2 years ago

mapping from C2M2 to NCPI minimal metadata format

karlcz commented 2 years ago

a proof of concept query produces output like the following, based on existing structures we have in the portal:

{
  "study_registration":"https://data.4dnucleome.org/",
  "study_id":"4DN",
  "participant_id":"40bf9373-5631-468a-bca2-7a63f564982f",
  "specimen_id":"2f7c8bc5-841c-4e3c-992a-96eb5aab8799",
  "experimental_strategy":{
    "nid": 2,
    "name": "imaging assay",
    "description": "An assay that produces a picture of an entity."
  },
  "drs_id":"https://data.4dnucleome.org/4DNFIR11N231",
  "file_format":{
    "nid": 9,
    "name": "TIFF",
    "description": "A versatile bitmap format."
  },
  "file_nid":99
}

we'll need to add the transformation module to the export service which will

  1. rewrite the two nested objects { ... } to just expose the values from their name fields for the experimental_strategy and file_format fields
  2. drop the file_nid field we need for deriva's paginated output, but do not want in the cavatica manifest
  3. add the always null field "fhir_document_reference": null field
  4. transcode from JSON to CSV
mikedarcy commented 2 years ago

I've written the transform code to handle this. It is still not clear what the exact format for this file needs to be. Is it TSV/CSV, with/without a header row, and do the fields need to be quoted or not? We need this info from the Cavatica folks. Once we have that, the transform code can be parameterized accordingly to match their input format specification.

Here's a sample of the output with TSV and field quoting on.

"study_registration"    "study_id"  "participant_id"    "specimen_id"   "experimental_strategy" "drs_id"    "file_format"   "fhir_document_reference"
"https://data.4dnucleome.org"   "4DN"   "68172441-97c4-40cc-b73f-d0f5dbc5cc05"  "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60"  "RNA-seq assay" "https://data.4dnucleome.org/4DNFIYY2BFE8"  "FASTQ" "null"
"https://data.4dnucleome.org"   "4DN"   "68172441-97c4-40cc-b73f-d0f5dbc5cc05"  "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60"  "RNA-seq assay" "https://data.4dnucleome.org/4DNFIB2TRTRB"  "FASTQ" "null"
"https://data.4dnucleome.org"   "4DN"   "68172441-97c4-40cc-b73f-d0f5dbc5cc05"  "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60"  "RNA-seq assay" "https://data.4dnucleome.org/4DNFILMEJ3VP"  "BAM"   "null"
"https://data.4dnucleome.org"   "4DN"   "68172441-97c4-40cc-b73f-d0f5dbc5cc05"  "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60"  "RNA-seq assay" "https://data.4dnucleome.org/4DNFIHAA2UQA"  "bigWig"    "null"
"https://data.4dnucleome.org"   "4DN"   "68172441-97c4-40cc-b73f-d0f5dbc5cc05"  "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60"  "RNA-seq assay" "https://data.4dnucleome.org/4DNFIYRM7B9C"  "bigWig"    "null"
"https://data.4dnucleome.org"   "4DN"   "68172441-97c4-40cc-b73f-d0f5dbc5cc05"  "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60"  "RNA-seq assay" "https://data.4dnucleome.org/4DNFIKDO2JBN"  "TSV"   "null"
"https://data.4dnucleome.org"   "4DN"   "68172441-97c4-40cc-b73f-d0f5dbc5cc05"  "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60"  "RNA-seq assay" "https://data.4dnucleome.org/4DNFI4BW3NJU"  "TSV"   "null"
"https://data.4dnucleome.org"   "4DN"   "68172441-97c4-40cc-b73f-d0f5dbc5cc05"  "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a"  "RNA-seq assay" "https://data.4dnucleome.org/4DNFIGASMK31"  "FASTQ" "null"
"https://data.4dnucleome.org"   "4DN"   "68172441-97c4-40cc-b73f-d0f5dbc5cc05"  "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a"  "RNA-seq assay" "https://data.4dnucleome.org/4DNFIGIOYGZE"  "FASTQ" "null"
"https://data.4dnucleome.org"   "4DN"   "68172441-97c4-40cc-b73f-d0f5dbc5cc05"  "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a"  "RNA-seq assay" "https://data.4dnucleome.org/4DNFIX3KDSEZ"  "BAM"   "null"
"https://data.4dnucleome.org"   "4DN"   "68172441-97c4-40cc-b73f-d0f5dbc5cc05"  "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a"  "RNA-seq assay" "https://data.4dnucleome.org/4DNFI4CHO6W2"  "bigWig"    "null"
"https://data.4dnucleome.org"   "4DN"   "68172441-97c4-40cc-b73f-d0f5dbc5cc05"  "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a"  "RNA-seq assay" "https://data.4dnucleome.org/4DNFIHYEMQ9H"  "bigWig"    "null"
"https://data.4dnucleome.org"   "4DN"   "40bf9373-5631-468a-bca2-7a63f564982f"  "2f7c8bc5-841c-4e3c-992a-96eb5aab8799"  "imaging assay" "https://data.4dnucleome.org/4DNFI6ESS2HI"  "TIFF"  "null"
"https://data.4dnucleome.org"   "4DN"   "40bf9373-5631-468a-bca2-7a63f564982f"  "2f7c8bc5-841c-4e3c-992a-96eb5aab8799"  "imaging assay" "https://data.4dnucleome.org/4DNFITBB4TZW"  "TIFF"  "null"
"https://data.4dnucleome.org"   "4DN"   "40bf9373-5631-468a-bca2-7a63f564982f"  "2f7c8bc5-841c-4e3c-992a-96eb5aab8799"  "imaging assay" "https://data.4dnucleome.org/4DNFI5NZRDNB"  "TIFF"  "null"
"https://data.4dnucleome.org"   "4DN"   "40bf9373-5631-468a-bca2-7a63f564982f"  "2f7c8bc5-841c-4e3c-992a-96eb5aab8799"  "imaging assay" "https://data.4dnucleome.org/4DNFIKF54SOC"  "TIFF"  "null"
"https://data.4dnucleome.org"   "4DN"   "40bf9373-5631-468a-bca2-7a63f564982f"  "2f7c8bc5-841c-4e3c-992a-96eb5aab8799"  "imaging assay" "https://data.4dnucleome.org/4DNFIRBGCFUC"  "TIFF"  "null"
"https://data.4dnucleome.org"   "4DN"   "40bf9373-5631-468a-bca2-7a63f564982f"  "2f7c8bc5-841c-4e3c-992a-96eb5aab8799"  "imaging assay" "https://data.4dnucleome.org/4DNFIFAEGTBE"  "TIFF"  "null"
karlcz commented 2 years ago

A prototype of this is now available in the Export menu in app-dev catalog 1 for both the CFDE:file and personal_collection tables. It will likely give timeouts on the file table unless you narrow the search to a small enough set of files.

For now, the menu item is labeled "NCPI file manifest (work in progress)"

karlcz commented 2 years ago

The prototype has been updated to use CSV (comma separators).

karlcz commented 2 years ago

The prototype has been updated on revised guidance to include a file_name field and to rename the persistent ID field as drs_uri. We also received clarification that optional fields may be omitted, so we drop the fhir_document_reference field which has no mapping to C2M2.

Here's a brief sample output

file_name,drs_uri,study_registration,study_id,participant_id,specimen_id,experimental_strategy,file_format
SRS146847_hmrac2.tar.bz2,drs://drs.hmpdacc.org/DCMXexUzVMbF,"tag:hmpdacc.org,2022-04-04:",HHS,HHS_663835652,SRS146847,whole metagenome sequencing assay,Sequence record format
SRS104499_hmrac2.tar.bz2,drs://drs.hmpdacc.org/17yUCNG76ib91,"tag:hmpdacc.org,2022-04-04:",HHS,HHS_763719065,SRS104499,whole metagenome sequencing assay,Se
quence record format
karlcz commented 2 years ago

We were asked to rename file_name to just name in last feedback. We also decided to keep calling this export option "NCPI file manifest" for now.

RLC-DCPPC commented 2 years ago

I’ve passed your question on to the KF/SBG/Cavatica team.

On Thu, Apr 7, 2022 at 8:00 PM mikedarcy @.***> wrote:

I've written the transform code to handle this. It is still not clear what the exact format for this file needs to be. Is it TSV/CSV, with/without a header row, and do the fields need to be quoted or not? We need this info from the Cavatica folks. Once we have that, the transform code can be parameterized accordingly to match their input format specification.

Here's a sample of the output with TSV and field quoting on.

"study_registration" "study_id" "participant_id" "specimen_id" "experimental_strategy" "drs_id" "file_format" "fhir_document_reference" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIYY2BFE8" "FASTQ" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIB2TRTRB" "FASTQ" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFILMEJ3VP" "BAM" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIHAA2UQA" "bigWig" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIYRM7B9C" "bigWig" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIKDO2JBN" "TSV" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFI4BW3NJU" "TSV" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIGASMK31" "FASTQ" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIGIOYGZE" "FASTQ" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIX3KDSEZ" "BAM" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFI4CHO6W2" "bigWig" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIHYEMQ9H" "bigWig" "null" "https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFI6ESS2HI" "TIFF" "null" "https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFITBB4TZW" "TIFF" "null" "https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFI5NZRDNB" "TIFF" "null" "https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFIKF54SOC" "TIFF" "null" "https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFIRBGCFUC" "TIFF" "null" "https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFIFAEGTBE" "TIFF" "null"

— Reply to this email directly, view it on GitHub https://github.com/nih-cfde/cfde-deriva/issues/334#issuecomment-1092312624, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJT6LTLZE3UAN7H3NONBXYDVD5ZI5ANCNFSM5QN636UQ . You are receiving this because you commented.Message ID: @.***>

ctb commented 2 years ago

(in case you're all wondering what happened here with bob's latest e-mail, it looks like github finally got around to processing some long-delayed e-mail responses to issues - happened a bunch with me as well! The e-mail immediately above was actually sent back in April, I bet.)