jodyphelan / TBProfiler

Profiling tool for Mycobacterium tuberculosis to detect ressistance and strain type from WGS data
GNU General Public License v3.0
102 stars 42 forks source link

Question about tbdb.barcode.bed coordinates #340

Open mariaelf97 opened 3 months ago

mariaelf97 commented 3 months ago

Hello!

It seems that some mutations in the tbdb.barcode.bed file are not based on H37Rv's coordinates. Could you clarify which ones have a different reference coordinates?

Thank you

sachalau commented 2 months ago

Hi @mariaelf97 I'm looking at this file at the moment as well for my own needs. Can you clarify which mutations are not based on H37Rv ? Because as I understood the file, column 5 is supposed to be the alternative allele value, i.e., I don't think there should be a single instance where this value is equal to the reference allele.

I have a related question for @jodyphelan, what would be the output of the lineage classification if we (supposedly) sequenced the exact strain that was used for the reference genome.

I see in the bed file that there are some mutations that are defined for 4.9 ("H37Rv-like"). However, these are not reference allele but alternative allele. So my guess is that the reference genome would be given no lineage classification ? Or is there a default value encoded somewhere?

Thanks for your help

sachalau commented 2 months ago

As an update to my question, actually these 20 entries in the bed are equal to the reference allele

-- -- -- -- -- -- -- --
Chromosome 206480 206481 lineage4 C Euro-American LAM;T;S;X;H None
Chromosome 311612 311613 lineage4.9 G Euro-American (H37Rv-like) T1 None
Chromosome 420007 420008 lineage4.9 A Euro-American (H37Rv-like) T1 None
Chromosome 498530 498531 lineage4 A Euro-American LAM;T;S;X;H None
Chromosome 541200 541201 lineage4.9 A Euro-American (H37Rv-like) T1 None
Chromosome 546356 546357 lineage4 A Euro-American LAM;T;S;X;H None
Chromosome 599867 599868 lineage4 A Euro-American LAM;T;S;X;H None
Chromosome 662910 662911 lineage4 T Euro-American LAM;T;S;X;H None
Chromosome 903912 903913 lineage4.9 T Euro-American (H37Rv-like) T1 None
Chromosome 931122 931123 lineage4 T Euro-American LAM;T;S;X;H None
Chromosome 1250339 1250340 lineage4 A Euro-American LAM;T;S;X;H None
Chromosome 1396921 1396922 lineage4.9 T Euro-American (H37Rv-like) T1 None
Chromosome 1759251 1759252 lineage4.9 G Euro-American (H37Rv-like) T1 None
Chromosome 1907295 1907296 lineage4.9 G Euro-American (H37Rv-like) T1 None
Chromosome 2022867 2022868 lineage4.9 T Euro-American (H37Rv-like) T1 None
Chromosome 2825465 2825466 lineage4 G Euro-American LAM;T;S;X;H None
Chromosome 2994186 2994187 lineage4 T Euro-American LAM;T;S;X;H None
Chromosome 3367764 3367765 lineage4.9 G Euro-American (H37Rv-like) T1 None
Chromosome 3823158 3823159 lineage4.9 A Euro-American (H37Rv-like) T1 None
Chromosome 3830694 3830695 lineage4 A Euro-American LAM;T;S;X;H None

So my own concern is answered.

jodyphelan commented 2 months ago

Hi @mariaelf97 - the positions are actually from the H37Rv reference genome (there are several versions of the same genome but with different chromosome names).

@sachalau - yes as you noticed that column is not the reference allele but is the allele that is expected for that particular lineage

mariaelf97 commented 1 month ago

@jodyphelan Thank you for your reply. I understand that since H37Rv is lineage 4.9, the aforementioned mutations refer to REF base rather than ALT. However, I was wondering then what would be the ALT base in this case? Does that mean any base other than the reference base? my analyses on about 100 isolates shows the ALT base is pretty consistent across lineages except for a few cases. I'd appreciate your thoughts on this.