jodyphelan / TBProfiler

Profiling tool for Mycobacterium tuberculosis to detect ressistance and strain type from WGS data
GNU General Public License v3.0
102 stars 42 forks source link

lineage fields on tb_profiler uutput? #302

Open robertwhbaldwin opened 9 months ago

robertwhbaldwin commented 9 months ago

Will someone please explain what the "main_lin", "sub_lin" and "lin" fields mean in the output ? Is it possible to have a case where the main_lin and sub_lin fields are empty but the lin fields are being reported? Thanks - Robert

jodyphelan commented 9 months ago

Hi @robertwhbaldwin

Are you using the json outputs?

Check out these publications for a description of the lineage system: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-020-00817-3 https://www.nature.com/articles/ncomms5812

In a nutshell the samples can be classified as lineage 1-9. Within each of these you can assign a higher resolution level by adding extra digits. For example, a sample can be classified as L4 as the main lineage and within that lineage it can be further classified as sublineage L4.3.4.2. Each level of lineage designation have specific SNPs associated with them and this is how tb-profiler assigns lineage. In the example below the, tb-profiler finds SNPs specific to L4, L4.3, L4.3.4 and L4.3.4.2. So it condenses this down to reporting the Main lineage as L4 and the sublineage as L4.3.4.2.

  "lineage": [
    {
      "lin": "lineage4",
      "family": "Euro-American",
      "spoligotype": "LAM;T;S;X;H",
      "rd": "None",
      "frac": 1
    },
    {
      "lin": "lineage4.3",
      "family": "Euro-American (LAM)",
      "spoligotype": "mainly-LAM",
      "rd": "None",
      "frac": 1
    },
    {
      "lin": "lineage4.3.4",
      "family": "Euro-American (LAM)",
      "spoligotype": "LAM",
      "rd": "RD174",
      "frac": 1
    },
    {
      "lin": "lineage4.3.4.2",
      "family": "Euro-American (LAM)",
      "spoligotype": "LAM1;LAM4;LAM11",
      "rd": "RD174",
      "frac": 0.9984802431610942
    }
  ],
  "main_lin": "lineage4",
  "sublin": "lineage4.3.4.2",

It is possible to have the main_lin and sub_lin fields empty if it can't resolve all the lineages it found. For example, if the pipeline found SNPs for L4.3.4.2 and not all of the levels before (e.g. L4, L4.3, L4.3.4) then it won't be able to resolve the lineages into main and sublin