microbiomedata / metaMAGs

Workflow for metagenome assembled genomes generation.
5 stars 4 forks source link

add eukcc component to binning workflow #28

Closed aclum closed 4 months ago

aclum commented 5 months ago

JGI's eukcc database has been copied to /refdata/eukcc_db/eukcc2_db_ver_1.2

IMG taxon oids that can be used for testing, large assemblies but known to produce euk bins, or an existing metagenome can be spiked in with a small eukaryote for testing purposes. 3300067032 3300059473 3300059591 - this project produces just 1 euk bin Eukaryota; Ascomycota; Dothideomycetes; Pleosporales; Phaeosphaeriaceae; Parastagonospora

chienchi commented 5 months ago

The package task will add eukcc.csv.final to the final LQ_bin.zip file. So, the file will be part of this eum. In this case, we don't need to create a new file type enum, right?

example eukcc.csv.final file:

bin completeness    contamination   ncbi_lineage_taxIDs ncbi_lineage
bins.22.fa  0.0 0.0 1-131567-2759-554915-2605435-142796-33680-137627-5789-1115744   root,cellular organisms,Eukaryota,Amoebozoa,Evosea,Eumycetozoa,Myxogastria,Myxogastromycetidae,Physariida,Physaraceae
bins.10.fa  0.9 0.0 1-131567-2759-554915-2605435-142796-33083-2058181   root,cellular organisms,Eukaryota,Amoebozoa,Evosea,Eumycetozoa,Dictyostelia,Acytosteliales
bins.19.fa  0.84    0.0 1-131567-2759-2698737-33634-4762-4776-4777  root,cellular organisms,Eukaryota,Sar,Stramenopiles,Oomycota,Peronosporales,Peronosporaceae
aclum commented 4 months ago

That should be okay for now if it is included in the LQ_bin.zip file and follows a similar style to the mbin.sdb file being in the medium/high quality compression file. Longer term we need to think about how to store taxonomy information in the schema because taxonomy search is a requested feature for the data portal.