jodyphelan / TBProfiler

Profiling tool for Mycobacterium tuberculosis to detect ressistance and strain type from WGS data
GNU General Public License v3.0
102 stars 42 forks source link

potential issue with lineage typing #289

Open rderelle opened 1 year ago

rderelle commented 1 year ago

Dear Jody,

I came across some strange results for a SRA sample (ERR4797559) which is supposed to be part of the lineage 1.2.2 according to the tb-profiler website and confirmed with a local installation of tb-profiler.

However, looking at the mutation profile of this sample, I found 9 out of the 10 snp barcodes for lineage 1.2.2.2 to be present and I did not understand why this sample wasn't labelled '1.2.2.2'.

I then had a look at the python code and I believe I found the issue (function 'barcode' of the file barcode.py; line 70):

    for l in barcode_support:
        # If stdev of fraction across all barcoding positions > 0.15
        # Only look at positions with >5 reads
        tmp_allelic_dp = [x[1]/(x[0]+x[1]) for x in barcode_support[l] if sum(x)>5]
        if len(tmp_allelic_dp)==0: continue
        if stdev(tmp_allelic_dp)>0.15: continue

In my case, the fractions for the lineage 1.2.2.2 snps (tmp_allelic_dp) are [1.0, 0.0, 1.0, 1.0, 0.9622641509433962, 1.0, 1.0, 1.0, 1.0, 0.9666666666666667] and the linage is being discarded because the standard deviation is above 0.15.

If my findings are correct, I'm wondering wether this threshold makes sense since it discards lineages with 9/10 of barcode snps found. May I ask what is the rational behind this threshold?

Also, I found this problem to be present in about 6% of all SRA samples I have tested (>2k samples).

Thanks and hope this helps! Romain

ps: I could share with you the BAM file generated by tb-profiler for this particular sample.

rderelle commented 1 year ago

There might be a quick fix: tmp_allelic_dp = [x[1]/(x[0]+x[1]) for x in barcode_support[l] if x[1]>5] and then add a threshold for a minimum number of barcodes.