qunfengdong / BLCA

34 stars 12 forks source link

BLCA output table is not always consistently formatted (makes downstream parsing challenging) #22

Closed wolfgangrumpf closed 5 years ago

wolfgangrumpf commented 5 years ago

I'm using BLCA as part of a 16S identification pipeline, and I noticed some discrepancies in the way the table output is formatted. From my output .tab:

177  superkingdom  Bacteria  100.0  phylum  Firmicutes      100.0  class  Clostridia           100.0  order  Clostridiales       100.0  family  Not                  Available  70.5   genus                     Colidextribacter      70.5     species                   Colidextribacter   massiliensis       70.5

40   superkingdom  Bacteria  100.0  phylum  Firmicutes      100.0  class  Bacilli              100.0  order  Lactobacillales     100.0  family  Lactobacillaceae     100.0      genus  Lactobacillus             100.0                 species  Lactobacillus             gasseri            100.0

Not that the family "Not Available" in the first entry actually takes up two tabs - so if I try to parse this with awk, sometimes that column is my family name, sometimes's it's the word "not", and the next column, instead of the score, is "Available", bumping the score to the next column. The result is that not all results have the same number of columns, and the information is not easily parsable.

yingeddi2008 commented 5 years ago

Hi Wolfgangrumpf,

Thanks for incorporating BLCA as part of your pipeline and continuing feedback.

However, I am a bit puzzled by your concern. The default output of BLCA should be ";" and ":" delimited, hence the space in "Not Available" should not be a problem. I will update the program so that the space between "Not Available" is removed.

Eddi

wolfgangrumpf commented 5 years ago

Thanks, Eddi! Actually just replacing “Not Available” with “NA” would be awesome, as it would support more downstream analysis (e.g. processing in R) without having to do additional parsing!

Thanks for all your help!

Cheers,

Wolfgang Rumpf, Ph.D. ———————————— Bioinformatics Analyst The Institute for Genomic Medicine at The Abigail Wexner Research Institute Nationwide Children’s Hospital —————————————- Professor University of Maryland Global Campus

On Aug 28, 2019, at 11:56 AM, yingeddi2008 notifications@github.com wrote:

Hi Wolfgangrumpf,

Thanks for incorporating BLCA as part of your pipeline and continuing feedback.

However, I am a bit puzzled by your concern. The default output of BLCA should be ";" and ":" delimited, hence the space in "Not Available" should not be a problem. I will update the program so that the space between "Not Available" is removed.

Eddi

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

yingeddi2008 commented 5 years ago

Great suggestion! Thanks! I have updated the repo, please pull for the latest update!