medema-group / bigslice

A highly scalable, user-interactive tool for the large scale analysis of Biosynthetic Gene Clusters data
GNU Affero General Public License v3.0
75 stars 39 forks source link

Cases when the taxonomy information of the interested species is only partially available #39

Closed jinnjy closed 2 years ago

jinnjy commented 3 years ago

I am preparing the files according to the example input folder: https://github.com/medema-group/bigslice/blob/master/misc/input_folder_template/taxonomy/dataset_1_taxonomy.tsv

I used GTDB-tk 1.5 as my taxonomy assignment tool, I encountered some cases which GTDB-tk could not assign genus and species. GCA_010156995.1_ASM1015699v1_genomic dBacteria;pCyanobacteria;cCyanobacteriia;o;f;g;s GCA_010672345.1_ASM1067234v1_genomic dBacteria;pCyanobacteria;cCyanobacteriia;oElainellales;fElainellaceae;g;s__ GCA_010672835.1_ASM1067283v1_genomic dBacteria;pCyanobacteria;cCyanobacteriia;o;f;g;s

I wonder whether you have some suggestion for preparing the files for these case, thank you.

tamuanand commented 3 years ago

My suggestion would be post-process the above and fill in the values with the last known

c__Cyanobacteriia;o__Elainellales;f__Elainellaceae;g__;s__

will become

c__Cyanobacteriia;o__Elainellales;f__Elainellaceae;g__Elainellaceae_unknown;s__Elainellaceae_unknown

and for this

c__Cyanobacteriia;o__;f__;g__;s__

it will become

c__Cyanobacteriia;o__Cyanobacteria_unknown;f__Cyanobacteria_unknown;g__Cyanobacteria_unknown;s__Cyanobacteria_unknown