qiyunlab / binarena

BinaRena: Interactive Visualization and Binning of Metagenomic Contigs
BSD 3-Clause "New" or "Revised" License
29 stars 6 forks source link

Support for Greengenes taxonomy format #44

Open qiyunzhu opened 2 years ago

qiyunzhu commented 2 years ago

The Greengenes-style taxonomic lineage file format is widely used in microbiomics, such as QIIME 2, GTDB-tk, MetaPhlAn, etc. It would be good to let the user append taxonomic annotations of contigs by dragging and dropping a taxonomy file into the BinaRena window after the main data (e.g., the assembly files) are already loaded.

A Greengene-style file looks like this:

G000712055  k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Ruminococcus; s__Ruminococcus sp. HUN007
G001794515  k__Bacteria; p__Candidatus Yanofskybacteria; c__; o__; f__; g__; s__Candidatus Yanofskybacteria bacterium RIFCSPHIGHO2_02_FULL_46_19
G000257665  k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Micrococcales; f__Microbacteriaceae; g__Candidatus Aquiluna; s__Candidatus Aquiluna sp. IMCC13023
G000429005  k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Novosphingobium; s__Novosphingobium acidiphilum
G000166695  k__Bacteria; p__Firmicutes; c__Clostridia; o__Thermoanaerobacterales; f__Thermoanaerobacterales Family III. Incertae Sedis; g__Caldicellulosiruptor; s__Caldicellulosiruptor kristjanssonii
G900100945  k__Bacteria; p__Bacteroidetes; c__Sphingobacteriia; o__Sphingobacteriales; f__Sphingobacteriaceae; g__Mucilaginibacter; s__Mucilaginibacter gossypii
G000717345  k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Streptomycetales; f__Streptomycetaceae; g__Streptomyces; s__Streptomyces lydicus
G001298505  k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Corynebacteriales; f__Corynebacteriaceae; g__Corynebacterium; s__Corynebacterium pseudotuberculosis
G001439295  k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Cellvibrionales; f__Porticoccaceae; g__; s__SAR92 bacterium BACL16 MAG-120322-bin99
G000710465  k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Streptomycetales; f__Streptomycetaceae; g__Kitasatospora; s__Kitasatospora sp. MBT63

The goal is to automatically extract information of each of the seven ranks: kingdom, phylum, class, order, family, genus, and species, and put them into individual categorical columns.

In some instances there is domain before or in place of kingdom, and/or strain after species.

@pavia27 can comment on the adoption of this format.

AbhinavChede commented 2 years ago

Hi @qiyunzhu ,

Is the first column in the greengenes taxonomy format the contig id?

qiyunzhu commented 2 years ago

@AbhinavChede Yes!