jodyphelan / TBProfiler

Profiling tool for Mycobacterium tuberculosis to detect ressistance and strain type from WGS data
GNU General Public License v3.0
104 stars 43 forks source link

Lineage family missing in text report #222

Open pmenzel opened 2 years ago

pmenzel commented 2 years ago

Dear Jody,

I noticed that in v4.2.0 the lineage family is missing in the text report:

Summary
-------
ID: xxx
Date: Sun Jul  3 14:15:29 2022
Strain: lineage2.2.1
Drug-resistance: Pre-XDR-TB
Median Depth: 110

Lineage report
--------------
Lineage Estimated Fraction
lineage2    1.000
lineage2.2  0.999
lineage2.2.1    1.000

In v4.0.3 it looks like this:

Summary
-------
ID: xxx
Date: Sun Jul  3 14:12:31 2022
Strain: lineage2.2.1
Drug-resistance: Pre-XDR
Median Depth: 110

Lineage report
--------------
Lineage Estimated Fraction  Family  Spoligotype Rd
lineage2    1.000   East-Asian  Beijing RD105
lineage2.2  0.999   East-Asian (Beijing)    Beijing-RD207   RD105;RD207
lineage2.2.1    1.000   East-Asian (Beijing)    Beijing-RD181   RD105;RD207;RD181

The v4.2.0 json file contains the information though:

"lineage": [
    {
      "lin": "lineage2",
      "family": "East-Asian",
      "spoligotype": "Beijing",
      "rd": "RD105",
      "frac": 1
    },
    {
      "lin": "lineage2.2",
      "family": "East-Asian (Beijing)",
      "spoligotype": "Beijing-RD207",
      "rd": "RD105;RD207",
      "frac": 0.9990439770554493
    },
    {
      "lin": "lineage2.2.1",
      "family": "East-Asian (Beijing)",
      "spoligotype": "Beijing-RD181",
      "rd": "RD105;RD207;RD181",
      "frac": 1
    }
  ],
  "main_lin": "lineage2",
  "sublin": "lineage2.2.1",
jodyphelan commented 2 years ago

Hi @pmenzel,

Thanks for letting me know about this, it does indeed look like this information is not included in the latest version. I'm hoping to release a new version soon and could put this back in.

The default spoligotype reported is based on associations associations of spoligotype with lineage (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4166679/bin/ncomms5812-s2.xlsx). So it just reports the spoligotype(s) that have been found in the predicted lineage.

The new --spoligotype option actually performs the spoligotyping by looking for the spacers in the reads. This produces the binary reporesenation and the octal code. In the next release it is planned also to report the SIT number as seen in SITVIT2. See below for an example. So it may be slightly confusing if there are two family and spoligotype reports, in this case it may better to add the spoligtype as a separate section.

ID: por5A1
Date: Fri Jul  8 09:59:22 2022
Strain: lineage4.3.4.2
Drug-resistance: MDR-TB
Median Depth: 64

Lineage report
--------------
Lineage Estimated Fraction
lineage4    1.000
lineage4.3  1.000
lineage4.3.4    1.000
lineage4.3.4.2  0.998

Spoligotype report
------------------
Binary: 1111111111111111111100001111111100001000000
Octal: 777777607760400
Family: LAM4
SIT: 1106
pmenzel commented 2 years ago

Just noticed, that the .results.csv also does not include the family info.

jodyphelan commented 2 years ago

Thanks for letting me know. For the purpose of reporting would you rather have family reported from both methods (SNP scheme and spoligotype)? or just from the actual insilico spoligotyping?

pmenzel commented 2 years ago

Hm, I didn't get into the spoligotyping topic, yet.

For me, it's enoguh to have the family name (East-Asian (Beijing), etc) somewhere, no matter how it was done.

In v4.2.0 the json file still contains the family, so I can fetch them from there for a generating a report.

pmenzel commented 2 years ago

(I just opened the issue, in case the omission of the family in the text report was due to a bug)

jodyphelan commented 2 years ago

Ok, what I'll do is to add it back into the report in the next version.

jodyphelan commented 2 years ago

Just letting you know that with v4.3.0, the spoligotype information has now been added back into the lineage report. Additionally with the spoligotyping function added, it will perform spoligotyping and report the family and SIT information from SITVIT2.

Lineage report
--------------
Lineage Estimated Fraction      Spoligotype     Rd
lineage4        1.000   LAM;T;S;X;H     None
lineage4.3      1.000   mainly-LAM      None
lineage4.3.4    1.000   LAM     RD174
lineage4.3.4.2  0.998   LAM1;LAM4;LAM11 RD174

Spoligotype report
------------------
Binary: 1111111111111111111100001111111100001000000
Octal: 777777607760400
Family: LAM4
SIT: 1106