Closed leoisl closed 2 years ago
As mentioned in Slack, sorry but https://github.com/mbhall88/head_to_head_pipeline/blob/bcbc84971342a26cd0a9f0ad8df4f01dcf35c01c/analysis/transmission_clustering/eda/clustering.ipynb is probably the most up-to-date notebook for the clustering viz. I think the core logic is the same, I just changed some aesthetics maybe, but better to be safe.
In terms of converting the graph to a text format, I guess one of the formats in https://networkx.org/documentation/stable/reference/convert.html would be fine.
Clustering is implemented in branch cluster in my fork , and the current clustering output is a text file with the following format:
Cluster #1: mada_1-8, mada_115, mada_142, mada_1-11, mada_125, mada_1-28
Cluster #2: mada_1-25, mada_2-25
Cluster #3: mada_1-3, mada_1-7
Cluster #4: mada_1-41, mada_2-31
I am supposing this is just for human consumption, so this looks ok? Should we have a machine consumption format too?
this is perfect! You could make it tsv, and then is machine consumable?
Ok!
Clusters are now also being created in the pipeline. I changed the pipeline to get samples from all the 3 sites, instead of just Madagascar. I ended up with 151 samples, which are the same used in the H2H
pipeline.
The issue is that we got the exactly same psdm matrix:
$ diff -rqs /hps/nobackup/iqbal/leandro/tbpore3/tbpore/pipelines/snakemake/output/clusters_threshold_6/.tbpore/psdm.matrix.csv /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/distance/bcftools.matrix.csv
Files /hps/nobackup/iqbal/leandro/tbpore3/tbpore/pipelines/snakemake/output/clusters_threshold_6/.tbpore/psdm.matrix.csv and /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/distance/bcftools.matrix.csv are identical
But not the same clusters...
These are the clusters you got in the paper:
But I am confused as I don't see 151 datapoints in each the subpanels. Anyway I think I am doing sth wrong. Especially because for clustering threshold=6, in tbpore cluster
we have 124 clusters, with 111 being singleton clusters and 13 being the following non-singletons:
Cluster #1: 17_616156 17_616026
Cluster #2: R28182 R20260 R18040 R27937 R21408 R20574 R18043 R27725 R22601 R26791
Cluster #3: R20983 R21770
Cluster #4: R29816 R36431 R24120
Cluster #5: R30078 R27657 R28012
Cluster #6: R31095 R30234
Cluster #7: mada_1-28 mada_142 mada_125 mada_115 mada_1-8 mada_1-11
Cluster #8: mada_2-25 mada_1-25
Cluster #9: mada_1-7 mada_1-3
Cluster #10: mada_1-41 mada_2-31
Cluster #11: mada_2-46 mada_1-46
Cluster #12: mada_2-53 mada_1-53
Cluster #13: mada_152 mada_132
Just looking at the number of non-singletons clusters and their size, it does not match the clustering at the top right panel of the figure... e.g. the largest tbpore
cluster is cluster#2, which has size 10, but there is no cluster with size 10 on the panel. I am wondering what did I do wrong, could you please help me @mbhall88 ? Clustering script is here: https://github.com/leoisl/tbpore/blob/cluster/tbpore/clustering.py
FYI, these are the clusters with threshold=12. The 3 first largest clusters have size 10, 8 and 7, and we find such clusters in the bottom-right panel of the figure, but not the 4th largest cluster, cluster#5, with size 5...:
Cluster #1: 17_616026 17_616156
Cluster #2: R22601 R21408 R18040 R26791 R20574 R18043 R27725 R27937 R20260 R28182
Cluster #3: R30396 R20896
Cluster #4: R21770 R20983
Cluster #5: R26778 R23146 R30420 R32929 R28980
Cluster #6: R29598 R24100
Cluster #7: R27657 R24120 R29816 R30078 R31095 R30234 R28012 R36431
Cluster #8: mada_125 mada_115 mada_142 mada_1-8 mada_1-11 mada_1-28 mada_2-42
Cluster #9: mada_1-25 mada_2-25
Cluster #10: mada_1-7 mada_1-3
Cluster #11: mada_1-40 mada_1-44
Cluster #12: mada_1-41 mada_2-31
Cluster #13: mada_2-46 mada_1-46
Cluster #14: mada_2-53 mada_1-53
Cluster #15: mada_103 mada_124 mada_110 mada_112
Cluster #16: mada_148 mada_109 mada_120 mada_126
Cluster #17: mada_152 mada_132
I think I will also try to generate the figures themselves
Michael excludes singletons from the figures - if the singletons are never clustered by anyone, he leaves tghem out @leoisl
your clusters look ok to me - 8 clusters which are pairs, just like michael has at threshold 6
i dont understand how you can have the same psdm and different clusters
Generating the figures is not that useful if we already know the clusters are different...
Regarding the largest cluster that tbpore cluster
generated with threshold=6:
Cluster #2: R28182 R20260 R18040 R27937 R21408 R20574 R18043 R27725 R22601 R26791
I drew it and checked all the edges, and everything matches:
So I don't understand why we don't see this cluster in the paper fig...
I am reaching the conclusion that the figure from the paper was generated from the outdated pipeline output...
It is worth to remember we got exactly the same psdm
matrix:
$ diff -rqs /hps/nobackup/iqbal/leandro/tbpore3/tbpore/pipelines/snakemake/output/clusters_threshold_6/.tbpore/psdm.matrix.csv /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/distance/bcftools.matrix.csv
Files /hps/nobackup/iqbal/leandro/tbpore3/tbpore/pipelines/snakemake/output/clusters_threshold_6/.tbpore/psdm.matrix.csv and /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/distance/bcftools.matrix.csv are identical
But I can see that H2H
psdm
matrix was updated on the last rerun @mbhall88 did:
$ ls -lhrt /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/distance/bcftools.matrix.csv
-rw-rw-r-- 1 mbhall88 iqbal 100K May 6 02:26 /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/distance/bcftools.matrix.csv
But I think the clusters were not updated, as they were generated with the outdated psdm
matrix
so, we just need to generate the h2h clusters and figure?
I think so, I can try to do it
Just realised that @mbhall88 already reproduced the new clustering figures using updated data in a commit on last friday: https://github.com/mbhall88/head_to_head_pipeline/commit/add0bd965c0afe9ab7cd3675f5364ef05a114fb8 . This is the new clustering figure, that for some reason uses nanopore thresholding of 5 and 10, instead of 6 as 12:
The number and size of non-singleton clusters match exactly what we are getting with tbpore cluster
, so I guess H2H
and tbpore
are generating the same clusters. I could go further and also generate the paper figure if you want to be even more sure
Yeah, I updated the matrix and figures last week after I realised the VCFs hadn't been updated. So the figure in the preprint is not "correct" anyway. The one in the comment you linked is the correct one.
We changed to threshold 5 and 10 - although 10 was arbitrary and could easily have been 11 (@iqbal-lab and I discussed this already).
I could go further and also generate the paper figure if you want to be even more sure
As mentioned above the paper figure is out of date now so don't bother.
Reopening, Zam wanted to see the clusters themselves. They are identical to H2H (only change is aesthetic: the S
nodes not always appear at the bottom of the panels):
looks good!
Command:
tbpore cluster <list_of_consensus> -o <outdir>
Will run
psdm
and generate the clusters as in https://github.com/mbhall88/head_to_head_pipeline/blob/master/analysis/transmission_clustering/notebooks/clustering.py.ipynb . No need for cluster visualisation, just the identifiers of the samples that are clustered, like comma-sep lists.Another option is based on this excerpt from slack:
Then the user I think would need to run a couple of
cat
commands to update some files,psdm
to update the matrix and a script we provide to build the clusters from thepsdm
matrix. As this is some commands, it seems it would be easier for them to have thistbpore cluster
command.The current
tbpore
execution becomes commandtbpore process