morrislab / pairtree

Pairtree is a method for reconstructing cancer evolutionary history in individual patients, and analyzing intratumor genetic heterogeneity. Pairtree focuses on scaling to many more cancer samples and cancer cell subpopulations than other algorithms, and on producing concise and informative interactive characterizations of posterior uncertainty.
MIT License
37 stars 11 forks source link

TypeError #4

Closed brucemoran closed 3 years ago

brucemoran commented 3 years ago

Hi,

getting TypeError below with data attached running in the Singularity I posted previously, what is the cause of this can you see?

pairtree --params OVBM_GRCh38.pairtree.json OVBM_GRCh38.pairtree.ssm OVBM_GRCh38.pairtree.results.npz
Sampling trees:   0%|  | 77/120000 [00:13<3:53:21,  8.56tree/s]Exception occurred in child process: invalid value encountered in true_divide
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/opt/miniconda/envs/pairtree/lib/python3.9/concurrent/futures/process.py", line 243, in _process_worker
[Archive 2.zip](https://github.com/morrislab/pairtree/files/6150517/Archive.2.zip)

    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/opt/miniconda/envs/pairtree/share/pairtree/bin/../lib/tree_sampler.py", line 478, in _run_chain
    new_samp, log_p_new_given_old, log_p_old_given_new = _generate_new_sample(
  File "/opt/miniconda/envs/pairtree/share/pairtree/bin/../lib/tree_sampler.py", line 372, in _generate_new_sample
    W_dests_old = _make_W_dests_combined(
  File "/opt/miniconda/envs/pairtree/share/pairtree/bin/../lib/tree_sampler.py", line 358, in _make_W_dests_combined
    W_dests_uniform = _make_W_dests_uniform(subtree_head, curr_parent, adj, anc)
  File "/opt/miniconda/envs/pairtree/share/pairtree/bin/../lib/tree_sampler.py", line 304, in _make_W_dests_uniform
    weights /= np.sum(weights)
FloatingPointError: invalid value encountered in true_divide
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/miniconda/envs/pairtree/share/pairtree/bin/pairtree", line 169, in <module>
    main()
  File "/opt/miniconda/envs/pairtree/share/pairtree/bin/pairtree", line 130, in main
    adjm, phi, llh, accept_rate = tree_sampler.sample_trees(
  File "/opt/miniconda/envs/pairtree/share/pairtree/bin/../lib/tree_sampler.py", line 635, in sample_trees
    raise exception
FloatingPointError: invalid value encountered in true_divide
Exception ignored in: <function tqdm.__del__ at 0x7f224d689940>
Traceback (most recent call last):
  File "/opt/miniconda/envs/pairtree/lib/python3.9/site-packages/tqdm/std.py", line 1145, in __del__
  File "/opt/miniconda/envs/pairtree/lib/python3.9/site-packages/tqdm/std.py", line 1299, in close
  File "/opt/miniconda/envs/pairtree/lib/python3.9/site-packages/tqdm/std.py", line 1492, in display
  File "/opt/miniconda/envs/pairtree/lib/python3.9/site-packages/tqdm/std.py", line 1148, in __str__
  File "/opt/miniconda/envs/pairtree/lib/python3.9/site-packages/tqdm/std.py", line 1450, in format_dict
TypeError: cannot unpack non-iterable NoneType object

Thanks,

Bruce Archive 2.zip

jwintersinger commented 3 years ago

Hi Bruce,

Thanks for reporting this! The problem occurred because you have only one cluster in the input, and there's only one possible tree structure in this configuration. My code didn't like this. I fixed it, so once you pull the latest revision, Pairtree will run successfully on your data.

Unfortunately, the output won't be terribly interesting with only one tree. All trees will look like this:

image

Try running bin/clustervars on your data with the --model pairwise option. This should produce 18 clusters on your data, which will give more interesting and informative trees.

Please let me know if you run into any other issues!

brucemoran commented 3 years ago

OK Jeff, thanks for getting back so quickly.

I had run pyclone-vi and gotten the one cluster.

Running clustervars --model pairwise OVBM_GRCh38.pairtree.ssm OVBM_GRCh38.pairtree.json out_params_fn.json gives error:

Traceback (most recent call last):
  File "/opt/miniconda/envs/pairtree/share/pairtree/bin/clustervars", line 267, in <module>
    main()
  File "/opt/miniconda/envs/pairtree/share/pairtree/bin/clustervars", line 238, in main
    vids, assigns, llhs = _cluster(
  File "/opt/miniconda/envs/pairtree/share/pairtree/bin/clustervars", line 140, in _cluster
    supervars, clustrel_posterior, _, _, _ = clustermaker.use_pre_existing(
  File "/opt/miniconda/envs/pairtree/share/pairtree/bin/../lib/clustermaker.py", line 24, in use_pre_existing
    clust_posterior, clust_evidence = pairwise.calc_posterior(supervars, logprior, rel_type='supervariant', parallel=parallel)
  File "/opt/miniconda/envs/pairtree/share/pairtree/bin/../lib/pairwise.py", line 151, in calc_posterior
    if parallel > 0:
TypeError: '>' not supported between instances of 'NoneType' and 'int'

That uses the previous json, tried with json with blanks for clusters and garbage which gives same.

Bruce

jwintersinger commented 3 years ago

Hi Bruce,

I ran into the same error you just listed (because I did something dumb in the code). I fixed it in 4749ac2d08eef10867247770ad479fa07b732820. If you pull the latest revision, you should get that fix.

Please let me know if you have any other problems!

brucemoran commented 3 years ago

Hi Jeff,

great that's working for me now, many thanks.

I presume that the clusters from clustervars are relative to the populations as in the image below? I.e. that the mutations in the 18th cluster are those that are in the 18th population? Do populations and clusters line up in this way?

So for example I can say that pop. 18 mutations (there are only 3) are possibly indicative of the clone driving the 3 metastases (cols 1,2,3 vs. primary in col 4)?

Thanks,

Bruce

image
jwintersinger commented 3 years ago

Hi Bruce,

I presume that the clusters from clustervars are relative to the populations as in the image below?

Yes, that's correct, provided you don't pass --reorder-subclones to bin/plottree. (This option will relabel the tree nodes based on the tree structure, so that ancestor nodes will always have lower indices than their descendants.)

After playing with your data a bit more, I was able to get a better clustering with these parameters:

python3 ~/work/pairtree/bin/clustervars --concentration -0.5 --seed 1337 OVBM_GRCh38.ssm OVBM_GRCh38.params.json OVBM_GRCh38.linfreq.params.json

The goal with clustering is to make sure the VAFs of the cluster members are such that variants within a cluster have about the same VAFs, but variants in two different clusters have different VAFs. You can check this with bin/plotvars before running the full bin/pairtree. Depending on the dataset, you might get better results with either cluster model (--model pairwise or --model linfreq) and with varying clustering parameters (--prior and --concentration for pairwise, or just --concentration for linfreq). In this case, the parameters above gave a good result with five clusters. You can get the corresponding params.json file here: https://www.cs.toronto.edu/~jawinter/plots/2021.03.18.1/OVBM_GRCh38.linfreq.params.json

I then ran Pairtree on those clusters. I got the following result: https://www.cs.toronto.edu/~jawinter/plots/2021.03.18.1/OVBM_GRCh38.plottree.html. There's not a lot of difference between the met and primary samples.

image

The patterns are similar to what you saw in the 18-cluster case. Here, it's somewhat interesting that pop. 1 is present in two of the mets but not primary or 17N. I more or less believe this result based on the underlying VAFs.

But there's definitely some copy-number weirdness happening in some of the clusters, which becomes clear if you look at the supervariants.

image

If you view the web page yourself, you can mouse over the entries in this table to get the maximum likelihood estimate (MLE) subclonal frequencies for each supervariant in each sample. 17N in particular has MLE subclonal frequencies that are well over 1 for several clusters, which is indicative of uncorrected copy number events. If you removed 17N, you might end up with a different clustering and/or tree structure, but I don't think it would change massively. You might end up with a similar linear tree, but with a different order of clusters (1 -> 2 -> 4 -> 3 -> 5).

brucemoran commented 3 years ago

Hi Jeff,

this is really good of you to run my data, really appreciated. Think it will be useful for other users too.

A couple of questions:

  1. for --concentration how did you arrive at 0.5? Do you run at default -2 then try -1, 0, 0.5 etc?
  2. The supervariant concept is very useful, but I don't see how MLE indicates uncorrected CNA.
  3. Is (2) above the 'copy number weirdness' you are referring to? Can you explain further? I know you only use diploid loci (read in Issue #3, and seen in your analysis here). NB this is high-grade ovarian primary and CNA are highly abundant, covering over 75% of Primary genome.

What is interesting is that 14N and 17N were at the same site (or very close), and we think 17N is probably a recurrence of 14N therefore. So seeing Pops 4 and 5, but also divergence at Pops 1 and 3 is interesting. Also that Pop 1 is seen in 16N, but it also has exclusive Pop 0, and shares Pop 2 with Primary.

In short I think this adds quite a lot to the analysis and interpretation of the case.

Thanks again,

Bruce

jwintersinger commented 3 years ago

Hi Bruce,

I'm glad to help!

  1. Yeah, to decide on --concentration, I did exactly what you described. I tried different values and looked at the results using bin/plotvars for each. (I actually used the same series of values you listed -- I started at the default of -2, which had too few clusters for my liking. I then went to -1 (too few clusters), 0 (too many clusters), and then settled for -0.5.) Unfortunately, I wasn't able to settle on a clustering method that works well in all settings. People's data differ a lot with respect to number of variants, number of cancer samples, and read depth, and a certain choice of clustering model (pairwise or linfreq) and the associated hyperparameters may work well in one setting, but perform poorly in another. So, in practice, I try a number of combinations and then go with the clustering result that looks best.
  2. When the MLE subclonal frequency for a supervariant is well over 1 for a cancer sample, it suggests there was a missed copy number event. Since the subclonal frequencies indicate what proportion of cells carry a given mutation, these values should range between 0 and 1. When the MLEs are substantially in excess of 1, it suggests a problem in the data.
    • Suppose you have V=6 variant reads and T=10 total reads (implying T-V=4 reference reads), with omega=0.5. The omega value indicates the probability of observing the variant allele in a cell carrying the mutation, so 0.5 corresponds to a diploid cell unaffected by copy number.
    • I'll refer to the MLE subclonal frequency as phi_mle. We always have phi_mle = V / (omega*T). Here we get phi_mle = 6/5 = 1.2. Having phi_mle > 1 in this case isn't really problematic, since the low read depth makes our estimates imprecise.
    • But now suppose we have another variant that occurred at a locus with a copy-number loss of the reference allele. Suppose we see V=1000 and T=1000. Now imagine we missed calling the copy number, so even though we should set omega to 1, we've left it at omega=0.5. This gives phi_mle=2.0, which makes no sense.
    • We obtain the supevariant read count for a cluster by summing the counts of the constituent variants. Effectively, we get V for the supervariant by summing all the Vs of the individual variants; we get T by summing the omega*T values for the individual variants; and we fix omega = 1. If the resulting supervariant has a phi_mle >> 1 (where the extent to which phi_mle must exceed 1 to be deemed anomalous of course depends on the magnitude of its T), it suggests many of the constituent variants were affected by copy-number events (whether clonal or subclonal) that aren't reflected in their omega values.
    • To figure out the correct omega value for an individual SNV at a locus affected by a CNA, you unfortunately need to fulfill a rather difficult set of conditions. You must know that the CNA occurred prior to or concurrently with the SNV; you must know that no subsequent (subclonal) CNAs occurred after the SNV at that locus; and you must be able to phase the SNV. All of these conditions are in service of the requirement that, for any cell carrying the SNV, you can say "there will be A reference copies of that locus and B variant copies, such that the probability of observing the variant in that cell is omega = B/(A+B)."
    • (As an aside: it would have been possible to design the tree search to put CNA events on the tree, including subclonal events, and account for how they changed the probability of observing the variant. PhyloWGS, which our lab published back in 2015, did exactly this. But our experience in the Pan-Cancer Analysis of Whole Genomes project convinced me that calling subclonal CNAs from bulk DNA-seq data is way too messy and difficult, and that the results from methods that do this can't be relied upon. Because of this, I think the most reliable solution is generally just discarding SNVs from genomic regions affected by copy number, unless it's a super-simple CNA like an early WGD or LOH, and trying to build a clone tree from the variants that remain.)
  3. I hope the above addresses your question about copy number weirdness. The key message from what preceded is this: if you see phi_mle >> 1, you know that you probably have missed copy-number events. These skew the relationship between allele frequency (i.e., the VAFs of the input data) and the subclonal frequencies those imply. This in turn can lead to inaccurate results, both with respect to tree structure, and to the subclonal and population frequencies we fit to the tree. When you see phi_mle >> 1, whether for individual variants or the supervariant, this is a clear sign that you have the wrong omega values for some variants in some samples, presumably because of missed (or incorrect) copy number calls. However, having phi_mle >> 1 is a sufficient but not necessary condition for the omega values being wrong -- i.e., it's a clear signal that something is messed up, but you can also have incorrect omega values even when phi_mle < 1. It's just that having phi_mle >> 1 makes this apparent.

You might find this script (https://github.com/morrislab/pairtree/blob/master/util/remove_high_vaf.py) helpful. It applies a statistical test to find variants whose phi_mles are implausibly high (>> 1), then outputs a modified params.json file that has added those variants to garbage.

With respect to the conclusion that you have pop. 3 in 17N, but absent from 14N: ultimately, this is a question of whether the subclonal frequencies are different for subclone 3 and its child subclone 4 in those two samples. Let's call these values {phi_3_14N, phi_3_17N, phi_4_14N, phi_4_17N}.

I hope this makes clear some of the concerns that arise with missed copy-number events, how this makes difficult the conversion between allele frequency and subclonal frequency, and how this in turn can result in getting the wrong tree structure and population frequencies.

Please let me know if any of the above is unclear, or if you have other questions!