morrislab / pairtree

Pairtree is a method for reconstructing cancer evolutionary history in individual patients, and analyzing intratumor genetic heterogeneity. Pairtree focuses on scaling to many more cancer samples and cancer cell subpopulations than other algorithms, and on producing concise and informative interactive characterizations of posterior uncertainty.
MIT License
37 stars 11 forks source link

Compatibility with R/bioconductor fishplot package #33

Closed jsha129 closed 1 year ago

jsha129 commented 1 year ago

Hi there, thank you for developing this wonderful tool. It's FAST! We want to use pairtree output for fishplot package ('https://github.com/chrisamiller/fishplot'). I ran the example data using seed 2022 and extracted data for the tree 0 to get 'eta' and 'parents'. The output seems to be incompatible with fishplot. Could you please provide suggestions to overcome this? Many thanks

> pop_freq
  Sample 1 Sample 2 Sample 3
0     0.53     0.36     0.21
1     0.32     0.22     0.25
2     0.10     0.39     0.36
3     0.04     0.03     0.19
> parents
[1] 0 1 1
timepoints=c(0,30,60)    
fish = createFishObject(frac.table,parents,timepoints=timepoints)
Error in validateInputs(frac.table, parents, nest.level, clone.labels,  : 
  clones with same parent cannot have values that sum to more than the percentage of the parent: Problem is in clusters  2,3 at timepoint 2
ethanumn commented 1 year ago

Hi there --

So I'm assuming you're using the example data provided in the pairtree/example folder. If this is the case and you are setting the parameter --seed=2022, then I am not getting quite the same output you are for the eta values. Either way, I think it's a pretty simple fix to prepare the pairtree outputs to be inputted into fishplot.

The first thing to note is your pop_freq matrix has 4 populations, while the parents variable only has 3 populations. Pairtree outputs this by design, as the "Node 0" in all of the trees outputted represents the population of cells from which all cancerous cell populations are descended from. A more detailed explanation of this can be found in the Pairtree STAR protocol paper, particularly the following sections: 13. Interpreting the results from summposterior, and 16. Interpreting the results from plottree.

However, the main issue is the pop_freq matrix. I believe what fishplot is actually expecting are the phi values and the confusion here is from the different use of terminology. If I'm wrong please correct me.

I believe the following python code should fix your problem:

  1. Extract the subclonal fractions (phi) and parents (struct) from the pairtree output file (e.g., example.results.npz)
import os
import numpy as np
results_npz = np.load(os.path.join("path/to/pairtree/example/results, "example.results.npz"))
parents = results_npz["struct"][0] # assume you want the parents vector associated with the tree at index 0
phi = results_npz["phi"][0] # extract the tree constrained subclonal frequencies associated with the tree at index 0
  1. Reformat the parents vector (assumes you want to keep Node 0)
parents = np.insert(parents + 1, 0, 0) # format for fishplot

This data can then be exported to R using a variety of different packages. From reading the fishplot documentation and looking at their paper, I believe these are the inputs the program is looking for. I was able to successfully generate plots using this approach.

Please let me know if this is helpful.

Ethan

jsha129 commented 1 year ago

Hi Ethan, Thank you for answering the question. Your solution works! I wasn't sure how to read npz file in R, so I ended up using data for the best tree (tree0) exported as json. To do this from bash/commandline:

$PTDIR/bin/plottree \
    --runid ${patient} \
    --tree-index 0\
    --tree-json tree0.json \
    --phi-orientation populations_as_rows \
    $input.ssm  \
    $input_params.json \
    pairtree.results.npz \
    pairtree.results_tree0.html

Here's its implementation in R for the fishplot using the json data:

library(fishplot)

## get pop freq and parents data from pairtree export
## It was easier to read a json than npz file in R. Hence, data for tree0 (the best tree)  was exported as a json.
library("rjson")
json_file <- "tree0.json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))

##! get population frequencies 
frac.table <- Reduce(rbind, json_data$phi[1:length(json_data$phi)]) # not eta
frac.table <- 100 * round(frac.table, 2)
colnames(frac.table ) <- json_data$samples
rownames(frac.table ) <- seq(0, nrow(frac.table )-1)
print(frac.table)

## get the structure of the clones, aka parents
## need to insert 0 in the beginning since 0 is the founder clone for everything. 'parents + 1' because python is 0-index and R/fishplot is 1. 
parents <- c(0, json_data$parents + 1) 
print(parents)

#visualization. distribute samples at regular interval from 0 to 100.
timepoints <- seq(0, 100, 100/(ncol(frac.table) - 1) )     

#create a fish object
fish = createFishObject(frac.table,parents,timepoints=timepoints)

#calculate the layout of the drawing
fish = layoutClones(fish)

#draw the plot, using the splining method (recommended)
#and providing both timepoints to label and a plot title
fishPlot(fish,shape="spline",title.btm="Patient X",
         cex.title=0.5, vlines=timepoints, 
         vlab=colnames(frac.table))
ethanumn commented 1 year ago

Awesome, glad it worked out! Thanks for providing the code you used.