jodyphelan / tbdb

Standard database for the TBProfiler tool
GNU Lesser General Public License v3.0
28 stars 18 forks source link

errors in barcode db? #28

Closed 0xaf1f closed 3 years ago

0xaf1f commented 4 years ago

[apologies -- I first posted this in the tb-profiler repository and it better belongs here]

I was reading https://doi.org/10.1186/s13073-020-00726-5 and, in it, the authors mentioned discovering some errors in the Coll & co. barcode:

Where possible, we compared the resulting 539 mutations with the typing scheme by Coll et al., the most widely used method to stratify WGS data for MTBC [9]. Of the 89 SNPs that were in common, 82 results were in agreement. Subsequent personal communications with Francesc Coll revealed that the seven discrepancies were due to errors in his study and were eventually resolved (Additional file 3: Table S3).

The seven mutations being (searching the Coll et al - natcomm2014 column in the referenced table for "(solved)"):

ref pos gene group_subst lineage_marker Coll et al - natcomm2014
G 9304 Rv0006 G668D (ggc/gAc) NOT 4.7 4.8 4.9 4 4.7,4.8,4.9 (solved)
T 763031 Rv0667 A1075A (gct/gcC) NOT L4 4 (solved)
C 2154724 Rv1908c R463L (cgg/cTg) NOT L4 4 (solved)
A 4243346 Rv3794 Q38Q (caa/caG) 2.2.1 (sub-group) 4 sub-clade (solved)
C 4244220 Rv3794 L330L (ctg/Ttg) L5 L6 "animal strains" 1.2.1 (solved)
G 4326676 Rv3854c S266R (agc/agG) 2.2.1 (sub-group) 2.1.1. sub-clade (solved)
T 4407588 Rv3919c A205A (gca/gcG) NOT L4 4 (solved)

However, I checked for a couple of these positions in https://github.com/jodyphelan/TBProfiler/blob/master/db/tbdb.barcode.bed (and also in https://github.com/jodyphelan/tbdb/blob/master/barcode.bed ) and didn't find them at all

jodyphelan commented 4 years ago

Hi @0xaf1f

Thanks for letting me know, I was not aware of that publication. Its looks as though all the discrepant mutations are located in drug resistance genes. I filtered the original list from the Coll publication to remove most of the drug resistance genes when creating barcode.bed. So TB-Profiler is not using those mutations for the lineage predictions. Hope that clears it up.

Jody

0xaf1f commented 4 years ago

Yes, the whole subject of that paper is about lineage markers in drug-resistance genes, so it's not that anything in there should be necessarily discarded.

jodyphelan commented 4 years ago

Right! There is quite a bit of redundancy in the full list so I don't think that filtering out a couple of mutations will impact the predictions for most of the samples. We are actually currently looking at revising the barcode so I will be updating this in the next few weeks.