Open mjz1 opened 6 years ago
Hi! We didn't document post_assign_ssm.py
, as we never got it working exactly as we wanted. I'm surprised you managed to figure out everything with no documentation :) -- everything you said is basically accurate.
The only suggestion I can give now is to look at this shell script: https://gist.github.com/jwintersinger/d8dd1a221061aec446fae11b66b5b577. The basic idea is that you could run parallel instances of the posthoc script on batches of 150 SSMs, then merge the results into the JSON results generated for the original 5000 mutations after the fact. This approach ended up being really IO intensive, and as we were working on an extremely slow network filesystem, we abandoned it. You may have better luck, however, if you have fast IO. If you hack on the code a little, you may be able to get it working.
Good luck! Sorry I can't offer more direct advice.
I see. I'll give it a shot. I'm guessing this is why the priority ssms file option exists? In cases where we want to make sure known cancer gene mutations are involved in the phylogenetic reconstruction?
As for merging the results back into the JSON, it looks like what you have linked simply iterates through your samples, post hoc assigning 150 ssms at a time (ie. no remerging with the JSON). Since my post I was able to test and get the post_assign_ssm.py
working, so I'm okay as far as that goes. I see the JSON file it produces with node (cluster?) assignments for each mutation for each of the trees (n=2500). I'm not sure exactly how write_reports.py
works, but is it possible to use a similar approach to write the reports for these post-hoc mutations? Pick the best tree assignments and go with that?
I think for now our main goal is assignment of all SSMs to clusters
Hey,
Was hoping for some guidance on using the
post_assign_ssm.py
script.The idea is that we get the phylogenetic trees/population structure with subsampling ~5000 mutations, but after doing this we are interested in knowing to which population each non subsampled mutation belongs (for example, there may be a subclonal driver variant important for the tumour's progression that wasn't included in the subsample). The ideal is to end up with the results output by the
write_report.py
script for all mutations in a given sampleThis is what I understand the approach to be:
create_phylowgs_inputs.py
with the-s 5000
option on, as well as outputting all non-subsampled variants using--nonsubsampled-variants nonsubsampled_ssm_data.txt
.evolve.py
on the subsampled variants.write_results.py
.post_assign_ssm.py
, passing thenonsubsampled_ssm_data.txt
for thessm_file
option, thetrees.zip
file for thetrees_file
option, and then all of the non subsampled ssm_ids seperated by spaces?How close is this to what is supposed to be done? I feel like this may not be quite correct Does the
cnv_data.txt
need to be passed to thepost_assign_ssms.py
script as well? What kind of output does this produce? And can it be re-fed into thewrite_reports.py
script to get the type of output I want?Any help would be greatly appreciated.
Thanks!