Closed karlcz closed 2 years ago
a proof of concept query produces output like the following, based on existing structures we have in the portal:
{
"study_registration":"https://data.4dnucleome.org/",
"study_id":"4DN",
"participant_id":"40bf9373-5631-468a-bca2-7a63f564982f",
"specimen_id":"2f7c8bc5-841c-4e3c-992a-96eb5aab8799",
"experimental_strategy":{
"nid": 2,
"name": "imaging assay",
"description": "An assay that produces a picture of an entity."
},
"drs_id":"https://data.4dnucleome.org/4DNFIR11N231",
"file_format":{
"nid": 9,
"name": "TIFF",
"description": "A versatile bitmap format."
},
"file_nid":99
}
we'll need to add the transformation module to the export service which will
{ ... }
to just expose the values from their name
fields for the experimental_strategy
and file_format
fieldsfile_nid
field we need for deriva's paginated output, but do not want in the cavatica manifest"fhir_document_reference": null
fieldI've written the transform code to handle this. It is still not clear what the exact format for this file needs to be. Is it TSV/CSV, with/without a header row, and do the fields need to be quoted or not? We need this info from the Cavatica folks. Once we have that, the transform code can be parameterized accordingly to match their input format specification.
Here's a sample of the output with TSV and field quoting on.
"study_registration" "study_id" "participant_id" "specimen_id" "experimental_strategy" "drs_id" "file_format" "fhir_document_reference"
"https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIYY2BFE8" "FASTQ" "null"
"https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIB2TRTRB" "FASTQ" "null"
"https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFILMEJ3VP" "BAM" "null"
"https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIHAA2UQA" "bigWig" "null"
"https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIYRM7B9C" "bigWig" "null"
"https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIKDO2JBN" "TSV" "null"
"https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFI4BW3NJU" "TSV" "null"
"https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIGASMK31" "FASTQ" "null"
"https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIGIOYGZE" "FASTQ" "null"
"https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIX3KDSEZ" "BAM" "null"
"https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFI4CHO6W2" "bigWig" "null"
"https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIHYEMQ9H" "bigWig" "null"
"https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFI6ESS2HI" "TIFF" "null"
"https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFITBB4TZW" "TIFF" "null"
"https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFI5NZRDNB" "TIFF" "null"
"https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFIKF54SOC" "TIFF" "null"
"https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFIRBGCFUC" "TIFF" "null"
"https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFIFAEGTBE" "TIFF" "null"
A prototype of this is now available in the Export menu in app-dev catalog 1 for both the CFDE:file and personal_collection tables. It will likely give timeouts on the file table unless you narrow the search to a small enough set of files.
For now, the menu item is labeled "NCPI file manifest (work in progress)"
The prototype has been updated to use CSV (comma separators).
The prototype has been updated on revised guidance to include a file_name
field and to rename the persistent ID field as drs_uri
. We also received clarification that optional fields may be omitted, so we drop the fhir_document_reference
field which has no mapping to C2M2.
Here's a brief sample output
file_name,drs_uri,study_registration,study_id,participant_id,specimen_id,experimental_strategy,file_format
SRS146847_hmrac2.tar.bz2,drs://drs.hmpdacc.org/DCMXexUzVMbF,"tag:hmpdacc.org,2022-04-04:",HHS,HHS_663835652,SRS146847,whole metagenome sequencing assay,Sequence record format
SRS104499_hmrac2.tar.bz2,drs://drs.hmpdacc.org/17yUCNG76ib91,"tag:hmpdacc.org,2022-04-04:",HHS,HHS_763719065,SRS104499,whole metagenome sequencing assay,Se
quence record format
We were asked to rename file_name
to just name
in last feedback. We also decided to keep calling this export option "NCPI file manifest" for now.
I’ve passed your question on to the KF/SBG/Cavatica team.
On Thu, Apr 7, 2022 at 8:00 PM mikedarcy @.***> wrote:
I've written the transform code to handle this. It is still not clear what the exact format for this file needs to be. Is it TSV/CSV, with/without a header row, and do the fields need to be quoted or not? We need this info from the Cavatica folks. Once we have that, the transform code can be parameterized accordingly to match their input format specification.
Here's a sample of the output with TSV and field quoting on.
"study_registration" "study_id" "participant_id" "specimen_id" "experimental_strategy" "drs_id" "file_format" "fhir_document_reference" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIYY2BFE8" "FASTQ" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIB2TRTRB" "FASTQ" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFILMEJ3VP" "BAM" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIHAA2UQA" "bigWig" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIYRM7B9C" "bigWig" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIKDO2JBN" "TSV" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "6b33c51b-e5c6-46b3-8a57-bb3187ca1c60" "RNA-seq assay" "https://data.4dnucleome.org/4DNFI4BW3NJU" "TSV" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIGASMK31" "FASTQ" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIGIOYGZE" "FASTQ" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIX3KDSEZ" "BAM" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFI4CHO6W2" "bigWig" "null" "https://data.4dnucleome.org" "4DN" "68172441-97c4-40cc-b73f-d0f5dbc5cc05" "a4bf5b3c-4de4-48e3-b2a2-c4d4879cc70a" "RNA-seq assay" "https://data.4dnucleome.org/4DNFIHYEMQ9H" "bigWig" "null" "https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFI6ESS2HI" "TIFF" "null" "https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFITBB4TZW" "TIFF" "null" "https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFI5NZRDNB" "TIFF" "null" "https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFIKF54SOC" "TIFF" "null" "https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFIRBGCFUC" "TIFF" "null" "https://data.4dnucleome.org" "4DN" "40bf9373-5631-468a-bca2-7a63f564982f" "2f7c8bc5-841c-4e3c-992a-96eb5aab8799" "imaging assay" "https://data.4dnucleome.org/4DNFIFAEGTBE" "TIFF" "null"
— Reply to this email directly, view it on GitHub https://github.com/nih-cfde/cfde-deriva/issues/334#issuecomment-1092312624, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJT6LTLZE3UAN7H3NONBXYDVD5ZI5ANCNFSM5QN636UQ . You are receiving this because you commented.Message ID: @.***>
(in case you're all wondering what happened here with bob's latest e-mail, it looks like github finally got around to processing some long-delayed e-mail responses to issues - happened a bunch with me as well! The e-mail immediately above was actually sent back in April, I bet.)
A simplified, browser-based scenario has been specified:
For this we need a new export menu option on the personal collection (and file?) table:
~If Cavatica requires any special default values rather than blank fields for metadata we do not always have (i.e. nullable C2M2 fields), we might also need a transformer module in the export service to rewrite the NULL/blank values in the response from the catalog.~