mbhall88 / tbpore

Mycobacterium tuberculosis genomic analysis from Nanopore sequencing data
MIT License
11 stars 2 forks source link

Create subcommand `tbpore cluster` #25

Closed leoisl closed 2 years ago

leoisl commented 2 years ago

Command: tbpore cluster <list_of_consensus> -o <outdir>

Will run psdm and generate the clusters as in https://github.com/mbhall88/head_to_head_pipeline/blob/master/analysis/transmission_clustering/notebooks/clustering.py.ipynb . No need for cluster visualisation, just the identifiers of the samples that are clustered, like comma-sep lists.

Another option is based on this excerpt from slack:

I was basically thinking that in the docs we explain that after a tbpore run you do the clustering by running psdm between the previous consensus sequences (all in one file) and the new consensus. You would need to use psdm’s long-form output (-l) and you can append this long-form distances to an existing file of all previous distances. Then after this you append the new consensus sequence to the previous ones.

Then the user I think would need to run a couple of cat commands to update some files, psdm to update the matrix and a script we provide to build the clusters from the psdm matrix. As this is some commands, it seems it would be easier for them to have this tbpore cluster command.

The current tbpore execution becomes command tbpore process

mbhall88 commented 2 years ago

As mentioned in Slack, sorry but https://github.com/mbhall88/head_to_head_pipeline/blob/bcbc84971342a26cd0a9f0ad8df4f01dcf35c01c/analysis/transmission_clustering/eda/clustering.ipynb is probably the most up-to-date notebook for the clustering viz. I think the core logic is the same, I just changed some aesthetics maybe, but better to be safe.

In terms of converting the graph to a text format, I guess one of the formats in https://networkx.org/documentation/stable/reference/convert.html would be fine.

leoisl commented 2 years ago

Clustering is implemented in branch cluster in my fork , and the current clustering output is a text file with the following format:

Cluster #1: mada_1-8, mada_115, mada_142, mada_1-11, mada_125, mada_1-28
Cluster #2: mada_1-25, mada_2-25
Cluster #3: mada_1-3, mada_1-7
Cluster #4: mada_1-41, mada_2-31

I am supposing this is just for human consumption, so this looks ok? Should we have a machine consumption format too?

iqbal-lab commented 2 years ago

this is perfect! You could make it tsv, and then is machine consumable?

leoisl commented 2 years ago

Ok!

Clusters are now also being created in the pipeline. I changed the pipeline to get samples from all the 3 sites, instead of just Madagascar. I ended up with 151 samples, which are the same used in the H2H pipeline.

The issue is that we got the exactly same psdm matrix:

$ diff -rqs /hps/nobackup/iqbal/leandro/tbpore3/tbpore/pipelines/snakemake/output/clusters_threshold_6/.tbpore/psdm.matrix.csv /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/distance/bcftools.matrix.csv
Files /hps/nobackup/iqbal/leandro/tbpore3/tbpore/pipelines/snakemake/output/clusters_threshold_6/.tbpore/psdm.matrix.csv and /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/distance/bcftools.matrix.csv are identical

But not the same clusters...

These are the clusters you got in the paper:

image

But I am confused as I don't see 151 datapoints in each the subpanels. Anyway I think I am doing sth wrong. Especially because for clustering threshold=6, in tbpore cluster we have 124 clusters, with 111 being singleton clusters and 13 being the following non-singletons:

Cluster #1:     17_616156       17_616026
Cluster #2:     R28182  R20260  R18040  R27937  R21408  R20574  R18043  R27725  R22601  R26791
Cluster #3:     R20983  R21770
Cluster #4:     R29816  R36431  R24120
Cluster #5:     R30078  R27657  R28012
Cluster #6:     R31095  R30234
Cluster #7:     mada_1-28       mada_142        mada_125        mada_115        mada_1-8        mada_1-11
Cluster #8:     mada_2-25       mada_1-25
Cluster #9:     mada_1-7        mada_1-3
Cluster #10:    mada_1-41       mada_2-31
Cluster #11:    mada_2-46       mada_1-46
Cluster #12:    mada_2-53       mada_1-53
Cluster #13:    mada_152        mada_132

Just looking at the number of non-singletons clusters and their size, it does not match the clustering at the top right panel of the figure... e.g. the largest tbpore cluster is cluster#2, which has size 10, but there is no cluster with size 10 on the panel. I am wondering what did I do wrong, could you please help me @mbhall88 ? Clustering script is here: https://github.com/leoisl/tbpore/blob/cluster/tbpore/clustering.py

FYI, these are the clusters with threshold=12. The 3 first largest clusters have size 10, 8 and 7, and we find such clusters in the bottom-right panel of the figure, but not the 4th largest cluster, cluster#5, with size 5...:

Cluster #1:     17_616026       17_616156
Cluster #2:     R22601  R21408  R18040  R26791  R20574  R18043  R27725  R27937  R20260  R28182
Cluster #3:     R30396  R20896
Cluster #4:     R21770  R20983
Cluster #5:     R26778  R23146  R30420  R32929  R28980
Cluster #6:     R29598  R24100
Cluster #7:     R27657  R24120  R29816  R30078  R31095  R30234  R28012  R36431
Cluster #8:     mada_125        mada_115        mada_142        mada_1-8        mada_1-11       mada_1-28       mada_2-42
Cluster #9:     mada_1-25       mada_2-25
Cluster #10:    mada_1-7        mada_1-3
Cluster #11:    mada_1-40       mada_1-44
Cluster #12:    mada_1-41       mada_2-31
Cluster #13:    mada_2-46       mada_1-46
Cluster #14:    mada_2-53       mada_1-53
Cluster #15:    mada_103        mada_124        mada_110        mada_112
Cluster #16:    mada_148        mada_109        mada_120        mada_126
Cluster #17:    mada_152        mada_132
leoisl commented 2 years ago

I think I will also try to generate the figures themselves

iqbal-lab commented 2 years ago

Michael excludes singletons from the figures - if the singletons are never clustered by anyone, he leaves tghem out @leoisl

iqbal-lab commented 2 years ago

your clusters look ok to me - 8 clusters which are pairs, just like michael has at threshold 6

iqbal-lab commented 2 years ago

i dont understand how you can have the same psdm and different clusters

leoisl commented 2 years ago

Generating the figures is not that useful if we already know the clusters are different...

Regarding the largest cluster that tbpore cluster generated with threshold=6:

Cluster #2:     R28182  R20260  R18040  R27937  R21408  R20574  R18043  R27725  R22601  R26791

I drew it and checked all the edges, and everything matches: image

So I don't understand why we don't see this cluster in the paper fig...

I am reaching the conclusion that the figure from the paper was generated from the outdated pipeline output...

It is worth to remember we got exactly the same psdm matrix:

$ diff -rqs /hps/nobackup/iqbal/leandro/tbpore3/tbpore/pipelines/snakemake/output/clusters_threshold_6/.tbpore/psdm.matrix.csv /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/distance/bcftools.matrix.csv
Files /hps/nobackup/iqbal/leandro/tbpore3/tbpore/pipelines/snakemake/output/clusters_threshold_6/.tbpore/psdm.matrix.csv and /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/distance/bcftools.matrix.csv are identical

But I can see that H2H psdm matrix was updated on the last rerun @mbhall88 did:

$ ls -lhrt /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/distance/bcftools.matrix.csv
-rw-rw-r-- 1 mbhall88 iqbal 100K May  6 02:26 /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/distance/bcftools.matrix.csv

But I think the clusters were not updated, as they were generated with the outdated psdm matrix

iqbal-lab commented 2 years ago

so, we just need to generate the h2h clusters and figure?

leoisl commented 2 years ago

I think so, I can try to do it

leoisl commented 2 years ago

Just realised that @mbhall88 already reproduced the new clustering figures using updated data in a commit on last friday: https://github.com/mbhall88/head_to_head_pipeline/commit/add0bd965c0afe9ab7cd3675f5364ef05a114fb8 . This is the new clustering figure, that for some reason uses nanopore thresholding of 5 and 10, instead of 6 as 12:

image

The number and size of non-singleton clusters match exactly what we are getting with tbpore cluster, so I guess H2H and tbpore are generating the same clusters. I could go further and also generate the paper figure if you want to be even more sure

mbhall88 commented 2 years ago

Yeah, I updated the matrix and figures last week after I realised the VCFs hadn't been updated. So the figure in the preprint is not "correct" anyway. The one in the comment you linked is the correct one.

We changed to threshold 5 and 10 - although 10 was arbitrary and could easily have been 11 (@iqbal-lab and I discussed this already).

I could go further and also generate the paper figure if you want to be even more sure

As mentioned above the paper figure is out of date now so don't bother.

leoisl commented 2 years ago

Reopening, Zam wanted to see the clusters themselves. They are identical to H2H (only change is aesthetic: the S nodes not always appear at the bottom of the panels):

image

iqbal-lab commented 2 years ago

looks good!