Closed brucemoran closed 3 years ago
Hi Bruce,
Thanks for reporting this! The problem occurred because you have only one cluster in the input, and there's only one possible tree structure in this configuration. My code didn't like this. I fixed it, so once you pull the latest revision, Pairtree will run successfully on your data.
Unfortunately, the output won't be terribly interesting with only one tree. All trees will look like this:
Try running bin/clustervars
on your data with the --model pairwise
option. This should produce 18 clusters on your data, which will give more interesting and informative trees.
Please let me know if you run into any other issues!
OK Jeff, thanks for getting back so quickly.
I had run pyclone-vi
and gotten the one cluster.
Running clustervars --model pairwise OVBM_GRCh38.pairtree.ssm OVBM_GRCh38.pairtree.json out_params_fn.json
gives error:
Traceback (most recent call last):
File "/opt/miniconda/envs/pairtree/share/pairtree/bin/clustervars", line 267, in <module>
main()
File "/opt/miniconda/envs/pairtree/share/pairtree/bin/clustervars", line 238, in main
vids, assigns, llhs = _cluster(
File "/opt/miniconda/envs/pairtree/share/pairtree/bin/clustervars", line 140, in _cluster
supervars, clustrel_posterior, _, _, _ = clustermaker.use_pre_existing(
File "/opt/miniconda/envs/pairtree/share/pairtree/bin/../lib/clustermaker.py", line 24, in use_pre_existing
clust_posterior, clust_evidence = pairwise.calc_posterior(supervars, logprior, rel_type='supervariant', parallel=parallel)
File "/opt/miniconda/envs/pairtree/share/pairtree/bin/../lib/pairwise.py", line 151, in calc_posterior
if parallel > 0:
TypeError: '>' not supported between instances of 'NoneType' and 'int'
That uses the previous json, tried with json with blanks for clusters
and garbage
which gives same.
Bruce
Hi Bruce,
I ran into the same error you just listed (because I did something dumb in the code). I fixed it in 4749ac2d08eef10867247770ad479fa07b732820. If you pull the latest revision, you should get that fix.
Please let me know if you have any other problems!
Hi Jeff,
great that's working for me now, many thanks.
I presume that the clusters from clustervars
are relative to the populations
as in the image below? I.e. that the mutations in the 18th cluster are those that are in the 18th population? Do populations and clusters line up in this way?
So for example I can say that pop. 18 mutations (there are only 3) are possibly indicative of the clone driving the 3 metastases (cols 1,2,3 vs. primary in col 4)?
Thanks,
Bruce
Hi Bruce,
I presume that the clusters from clustervars are relative to the populations as in the image below?
Yes, that's correct, provided you don't pass --reorder-subclones
to bin/plottree
. (This option will relabel the tree nodes based on the tree structure, so that ancestor nodes will always have lower indices than their descendants.)
After playing with your data a bit more, I was able to get a better clustering with these parameters:
python3 ~/work/pairtree/bin/clustervars --concentration -0.5 --seed 1337 OVBM_GRCh38.ssm OVBM_GRCh38.params.json OVBM_GRCh38.linfreq.params.json
The goal with clustering is to make sure the VAFs of the cluster members are such that variants within a cluster have about the same VAFs, but variants in two different clusters have different VAFs. You can check this with bin/plotvars
before running the full bin/pairtree
. Depending on the dataset, you might get better results with either cluster model (--model pairwise
or --model linfreq
) and with varying clustering parameters (--prior
and --concentration
for pairwise
, or just --concentration
for linfreq
). In this case, the parameters above gave a good result with five clusters. You can get the corresponding params.json
file here: https://www.cs.toronto.edu/~jawinter/plots/2021.03.18.1/OVBM_GRCh38.linfreq.params.json
I then ran Pairtree on those clusters. I got the following result: https://www.cs.toronto.edu/~jawinter/plots/2021.03.18.1/OVBM_GRCh38.plottree.html. There's not a lot of difference between the met and primary samples.
The patterns are similar to what you saw in the 18-cluster case. Here, it's somewhat interesting that pop. 1 is present in two of the mets but not primary or 17N. I more or less believe this result based on the underlying VAFs.
But there's definitely some copy-number weirdness happening in some of the clusters, which becomes clear if you look at the supervariants.
If you view the web page yourself, you can mouse over the entries in this table to get the maximum likelihood estimate (MLE) subclonal frequencies for each supervariant in each sample. 17N in particular has MLE subclonal frequencies that are well over 1 for several clusters, which is indicative of uncorrected copy number events. If you removed 17N, you might end up with a different clustering and/or tree structure, but I don't think it would change massively. You might end up with a similar linear tree, but with a different order of clusters (1 -> 2 -> 4 -> 3 -> 5).
Hi Jeff,
this is really good of you to run my data, really appreciated. Think it will be useful for other users too.
A couple of questions:
--concentration
how did you arrive at 0.5
? Do you run at default -2
then try -1, 0, 0.5
etc?supervariant
concept is very useful, but I don't see how MLE indicates uncorrected CNA.What is interesting is that 14N and 17N were at the same site (or very close), and we think 17N is probably a recurrence of 14N therefore. So seeing Pops 4 and 5, but also divergence at Pops 1 and 3 is interesting. Also that Pop 1 is seen in 16N, but it also has exclusive Pop 0, and shares Pop 2 with Primary.
In short I think this adds quite a lot to the analysis and interpretation of the case.
Thanks again,
Bruce
Hi Bruce,
I'm glad to help!
--concentration
, I did exactly what you described. I tried different values and looked at the results using bin/plotvars
for each. (I actually used the same series of values you listed -- I started at the default of -2
, which had too few clusters for my liking. I then went to -1
(too few clusters), 0
(too many clusters), and then settled for -0.5
.) Unfortunately, I wasn't able to settle on a clustering method that works well in all settings. People's data differ a lot with respect to number of variants, number of cancer samples, and read depth, and a certain choice of clustering model (pairwise
or linfreq
) and the associated hyperparameters may work well in one setting, but perform poorly in another. So, in practice, I try a number of combinations and then go with the clustering result that looks best.V=6
variant reads and T=10
total reads (implying T-V=4
reference reads), with omega=0.5
. The omega
value indicates the probability of observing the variant allele in a cell carrying the mutation, so 0.5
corresponds to a diploid cell unaffected by copy number.phi_mle
. We always have phi_mle = V / (omega*T)
. Here we get phi_mle = 6/5 = 1.2
. Having phi_mle > 1
in this case isn't really problematic, since the low read depth makes our estimates imprecise.V=1000
and T=1000
. Now imagine we missed calling the copy number, so even though we should set omega
to 1, we've left it at omega=0.5
. This gives phi_mle=2.0
, which makes no sense.V
for the supervariant by summing all the V
s of the individual variants; we get T
by summing the omega*T
values for the individual variants; and we fix omega = 1
. If the resulting supervariant has a phi_mle >> 1
(where the extent to which phi_mle
must exceed 1 to be deemed anomalous of course depends on the magnitude of its T
), it suggests many of the constituent variants were affected by copy-number events (whether clonal or subclonal) that aren't reflected in their omega
values.omega
value for an individual SNV at a locus affected by a CNA, you unfortunately need to fulfill a rather difficult set of conditions. You must know that the CNA occurred prior to or concurrently with the SNV; you must know that no subsequent (subclonal) CNAs occurred after the SNV at that locus; and you must be able to phase the SNV. All of these conditions are in service of the requirement that, for any cell carrying the SNV, you can say "there will be A
reference copies of that locus and B
variant copies, such that the probability of observing the variant in that cell is omega = B/(A+B)
."phi_mle >> 1
, you know that you probably have missed copy-number events. These skew the relationship between allele frequency (i.e., the VAFs of the input data) and the subclonal frequencies those imply. This in turn can lead to inaccurate results, both with respect to tree structure, and to the subclonal and population frequencies we fit to the tree. When you see phi_mle >> 1
, whether for individual variants or the supervariant, this is a clear sign that you have the wrong omega
values for some variants in some samples, presumably because of missed (or incorrect) copy number calls. However, having phi_mle >> 1
is a sufficient but not necessary condition for the omega
values being wrong -- i.e., it's a clear signal that something is messed up, but you can also have incorrect omega
values even when phi_mle < 1
. It's just that having phi_mle >> 1
makes this apparent.You might find this script (https://github.com/morrislab/pairtree/blob/master/util/remove_high_vaf.py) helpful. It applies a statistical test to find variants whose phi_mle
s are implausibly high (>> 1), then outputs a modified params.json
file that has added those variants to garbage.
With respect to the conclusion that you have pop. 3 in 17N
, but absent from 14N
: ultimately, this is a question of whether the subclonal frequencies are different for subclone 3 and its child subclone 4 in those two samples. Let's call these values {phi_3_14N, phi_3_17N, phi_4_14N, phi_4_17N}
.
phi_2_17N_mle = 1.15 >> 1
and phi_3_17N_mle = 1.21 >> 1
. This is part of what concerns me about 17N
. In fact, if we took this sample in isolation, it would imply that subclone 3 could not be the child of subclone 2, since phi_3_17N_mle > phi_2_17N_mle
. (This constraint is effectively overridden by subclone 2 having a higher MLE phi
than subclone 3 in all three other samples -- nether solutions fits the data cleanly, but this is the less-bad choice.)phi_3
being higher than phi_4
in 16N
and 17N
(implying that 3 should be the ancestor of 4), but lower in 14N
and primary (implying 4 should be the ancestor of 3).I hope this makes clear some of the concerns that arise with missed copy-number events, how this makes difficult the conversion between allele frequency and subclonal frequency, and how this in turn can result in getting the wrong tree structure and population frequencies.
Please let me know if any of the above is unclear, or if you have other questions!
Hi,
getting TypeError below with data attached running in the Singularity I posted previously, what is the cause of this can you see?
Thanks,
Bruce Archive 2.zip