morrislab / phylowgs

Application for inferring subclonal composition and evolution from whole-genome sequencing data.
GNU General Public License v3.0
108 stars 54 forks source link

Post hoc mutation assignment #77

Open mjz1 opened 6 years ago

mjz1 commented 6 years ago

Hey,

Was hoping for some guidance on using the post_assign_ssm.py script.

The idea is that we get the phylogenetic trees/population structure with subsampling ~5000 mutations, but after doing this we are interested in knowing to which population each non subsampled mutation belongs (for example, there may be a subclonal driver variant important for the tumour's progression that wasn't included in the subsample). The ideal is to end up with the results output by the write_report.py script for all mutations in a given sample

This is what I understand the approach to be:

  1. Run the create_phylowgs_inputs.py with the -s 5000 option on, as well as outputting all non-subsampled variants using --nonsubsampled-variants nonsubsampled_ssm_data.txt.
  2. Run phylowgs using evolve.py on the subsampled variants.
  3. Run write_results.py.
  4. (this is where I am unsure) Run post_assign_ssm.py, passing the nonsubsampled_ssm_data.txt for the ssm_file option, the trees.zip file for the trees_file option, and then all of the non subsampled ssm_ids seperated by spaces?

How close is this to what is supposed to be done? I feel like this may not be quite correct Does the cnv_data.txt need to be passed to the post_assign_ssms.py script as well? What kind of output does this produce? And can it be re-fed into the write_reports.py script to get the type of output I want?

Any help would be greatly appreciated.

Thanks!

jwintersinger commented 6 years ago

Hi! We didn't document post_assign_ssm.py, as we never got it working exactly as we wanted. I'm surprised you managed to figure out everything with no documentation :) -- everything you said is basically accurate.

The only suggestion I can give now is to look at this shell script: https://gist.github.com/jwintersinger/d8dd1a221061aec446fae11b66b5b577. The basic idea is that you could run parallel instances of the posthoc script on batches of 150 SSMs, then merge the results into the JSON results generated for the original 5000 mutations after the fact. This approach ended up being really IO intensive, and as we were working on an extremely slow network filesystem, we abandoned it. You may have better luck, however, if you have fast IO. If you hack on the code a little, you may be able to get it working.

Good luck! Sorry I can't offer more direct advice.

mjz1 commented 6 years ago

I see. I'll give it a shot. I'm guessing this is why the priority ssms file option exists? In cases where we want to make sure known cancer gene mutations are involved in the phylogenetic reconstruction?

As for merging the results back into the JSON, it looks like what you have linked simply iterates through your samples, post hoc assigning 150 ssms at a time (ie. no remerging with the JSON). Since my post I was able to test and get the post_assign_ssm.py working, so I'm okay as far as that goes. I see the JSON file it produces with node (cluster?) assignments for each mutation for each of the trees (n=2500). I'm not sure exactly how write_reports.py works, but is it possible to use a similar approach to write the reports for these post-hoc mutations? Pick the best tree assignments and go with that?

I think for now our main goal is assignment of all SSMs to clusters