microbiomedata / metaMAGs

Workflow for metagenome assembled genomes generation.
5 stars 4 forks source link

review MAG output files #22

Closed aclum closed 7 months ago

aclum commented 8 months ago

From Neha from summer 2023, specifically I'm not sure if we are saving metabat-bins.tar.gz which would be needed to implement the eukaryotic binning.

    output{
        #retaining the sdb is very important for loading into IMG
        File? sdb = "mbin.sdb"
        #flag file to indicate that no bins were generated
        File? nobins = "mbin.nobins"
        #flag file to indicate that pipeline ran through without error
        File? success = "mbin.success"
        #checkm results
        File? checkm = "checkm_qa.out"
        #gtdbtk results
        File? bacsum = "gtdbtk_output/gtdbtk.bac120.summary.tsv"
        File? arcsum = "gtdbtk_output/gtdbtk.ar122.summary.tsv"
        #retaining the metabat-bins folder is important for downstream Euk pipeline
        #NOTE: if the lineage SDB is provided, change below to 'filtered-metabat-bins.tar.gz'
        File? lqbins = "metabat-bins.tar.gz"
        #hq+mq bins folder
        File? hqmqbins = "hqmq-metabat-bins.tar.gz"
        #optional to retain depth file, only for reprocessing
        File? depth = "metabat.depth"
    } 
aclum commented 7 months ago

When changes are made that aren't additive we also need a migration script so the data in mongo matches the schema. Here is a template to do this https://github.com/microbiomedata/nmdc-schema/blob/main/nmdc_schema/migrators/migrator_from_A_B_C_to_X_Y_Z.py

@eecavanna and @brynnz22 have done some of these so can help. We need a migration script to migrate 'Metagenome Bins Compression File' to 'Metagenome HQMQ Bins Compression File'

cc @hubin-keio

eecavanna commented 7 months ago

Here's a link to documentation about creating a migrator.

I will mock up a migrator specific to the schema changes in https://github.com/microbiomedata/nmdc-schema/pull/1791 now, which @chienchi can use as a reference.

eecavanna commented 7 months ago

I drafted this migrator, which y'all can use as a starting point: https://github.com/microbiomedata/nmdc-schema/pull/1837

There are three TODO items in it:

  1. Update the initial schema version identifier (in the code and in the file name)
  2. Update the final schema version identifier (in the code and in the file name)
  3. Verify that the data_object_set collection is, indeed, the only collection requiring migration for this schema change
    • This is something I don't have much experience with. I have lots of experience working with the migration framework, but very little experience translating schema changes into migration requirements. I think @brynnz22 is the person on our team that has the most experience translating schema changes into migration requirements.

As a reminder, I will be out Friday.

chienchi commented 7 months ago

Thank you @eecavanna , I have checked and updated the three TODO items. I think this is the only data_object to migrate. @aclum or @brynnz22 could you help to confirm?