jodyphelan / tbdb

Standard database for the TBProfiler tool
GNU Lesser General Public License v3.0
28 stars 18 forks source link

barcode.bed does not match https://doi.org/10.1186/s13073-020-00817-3 #33

Closed jemunro closed 3 years ago

jemunro commented 3 years ago

Hello,

I have noticed that the publication https://doi.org/10.1186/s13073-020-00817-3 and this repository state that the barcode now used by TB-Profiler is the same 421 SNP barcode detailed in said publication, however the barcode provided in this repository consists of 1048 SNPs. Can you please explain the difference between the two barcodes, and how the 1048 SNP barcode in particular was generated?

Thanks, Jacob

jodyphelan commented 3 years ago

Hi Jacob,

There is indeed a slight difference in the number of snps in the barcode. The barcode.bed file is actually a subset of https://github.com/GaryNapier/tb-lineages/blob/main/fst_results_clean_fst_1_for_paper.csv and contains more SNPs per lineage than the 421 list from the paper (up to max 10 per lineage). As it is relatively computationally inexpensive to call SNPs at single positions, the number of SNPs was increased to reduce the chance of a failed lineage call due to low coverage or large deletion. Additionally, the barcode contains SNPs for the lineage scheme for L5/6 detailed in https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000477.

I hope that helps, let me know if you have any more questions.

Jody

jemunro commented 3 years ago

Hi Jody,

Thanks, that is very helpful.