microbiomedata / WorkflowPlanning

This is primarily a repo for capturing policies and discussions around the different workflows for NMDC. This is also used for project management related pieces.
7 stars 6 forks source link

GFF3 outputs #23

Open hubin-keio opened 3 years ago

hubin-keio commented 3 years ago

The list of features (genes, etc.) is saved in GFF3 (a tab-delimited txt) file format. Will it be possible for the ETL process to those GFF3 files directly? We can explain the fields if needed. Having the workflows generate JSON outputs from the GFF3 is doable but it will create files in sizes several times bigger than those of the GFF3 files. Thoughts?

cmungall commented 3 years ago

I believe we discussed this and we agreed json is the lingua franca - cc @dehays

note we have a PR open for functional annotation (essentially column 9 of the GFF): https://github.com/microbiomedata/nmdc-metadata/pull/178

More context here: https://github.com/microbiomedata/nmdc-metadata/issues/176

hubin-keio commented 3 years ago

I am under the impression that Aim 3/Kitware knows how to read the GFF3 files and extract the information from there and that seems to be a more efficient way to ingest the data since generating JSON files is really an extra layer if GFF3 files can be used directly for ingestion. Perhaps we can briefly touch this point in the retreat meeting and follow up with Kitware to check which route might work better?

dwinston commented 3 years ago

I think the activity of producing the GFF3 is separable from the activity of generating JSON from GFF3. I also think the activity of authorship is separable from the activity of execution.

Thus, one potential route here is that @hubin-keio et al. can author the GFF3->JSON component, but the time it runs does not have to be at the same time as the upstream production of GFF3, and a different team (e.g. Aim 3/Kitware) can be responsible for executing the GFF3->JSON component.

hubin-keio commented 3 years ago

@dwinston , I understand that JSON generation can be separated from GFF3 generation. Please see my comment here regarding GFF3->JSON conversion: https://github.com/microbiomedata/nmdc-metadata/issues/184#issuecomment-748351684 .