Open hubin-keio opened 3 years ago
I believe we discussed this and we agreed json is the lingua franca - cc @dehays
note we have a PR open for functional annotation (essentially column 9 of the GFF): https://github.com/microbiomedata/nmdc-metadata/pull/178
More context here: https://github.com/microbiomedata/nmdc-metadata/issues/176
I am under the impression that Aim 3/Kitware knows how to read the GFF3 files and extract the information from there and that seems to be a more efficient way to ingest the data since generating JSON files is really an extra layer if GFF3 files can be used directly for ingestion. Perhaps we can briefly touch this point in the retreat meeting and follow up with Kitware to check which route might work better?
I think the activity of producing the GFF3 is separable from the activity of generating JSON from GFF3. I also think the activity of authorship is separable from the activity of execution.
Thus, one potential route here is that @hubin-keio et al. can author the GFF3->JSON component, but the time it runs does not have to be at the same time as the upstream production of GFF3, and a different team (e.g. Aim 3/Kitware) can be responsible for executing the GFF3->JSON component.
@dwinston , I understand that JSON generation can be separated from GFF3 generation. Please see my comment here regarding GFF3->JSON conversion: https://github.com/microbiomedata/nmdc-metadata/issues/184#issuecomment-748351684 .
The list of features (genes, etc.) is saved in GFF3 (a tab-delimited txt) file format. Will it be possible for the ETL process to those GFF3 files directly? We can explain the fields if needed. Having the workflows generate JSON outputs from the GFF3 is doable but it will create files in sizes several times bigger than those of the GFF3 files. Thoughts?