uclahs-cds / project-method-AlgorithmEvaluation-BNCH-000082-SRCRNDSeed

GNU General Public License v2.0
1 stars 0 forks source link

Phils add parse phylowgs subclones #103

Closed philsteinberg closed 1 year ago

philsteinberg commented 1 year ago

Description

Modified R script for parsing PhyloWGS output https://github.com/uclahs-cds/project-CPCGENE-subclonality/blob/master/200PT/ssm_cnv_from_json.R to extract number of subclones from best tree for each sample x seed x pipeline x mode.

Closes #93

Analysis Results

parse_consensus_tree_phyloWGS.R generates consensus tree for a pipeline output using the command (example):

Rscript ./parse_consensus_tree_PhyloWGS.R \
-s /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-mutect2-battenberg-phylowgs/output/call-SRC-1.0.0-rc.1/ILHNLNEV000007-T001-P01-F/PhyloWGS-2205be1/output/PhyloWGS-2205be1_366306_ILHNLNEV000007-T001-P01-F_Mutect2-Battenberg-summ.json.gz \
-o /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-mutect2-battenberg-phylowgs/output/consensus_tree

To run parse_consensus_tree_phyloWGS.R on all sample x seed x pipeline combinations run submission script run_parse_consensus_tree_PhyloWGS.sh after indicating the appropriate pipeline. parse_num_subclones_phyloWGS.R reads in all consensus trees in directory and writes summary output file.

This file includes all of the sample x seed combinations that have pipeline outputs. PhyloWGS has 10 or less subclone entries per sample because some sample x seed combinations error (https://github.com/uclahs-cds/project-method-AlgorithmEvaluation-BNCH-000082-SRCRNDSeed/issues/95, https://github.com/uclahs-cds/project-method-AlgorithmEvaluation-BNCH-000082-SRCRNDSeed/issues/56, https://github.com/uclahs-cds/project-method-AlgorithmEvaluation-BNCH-000082-SRCRNDSeed/issues/63).

Mutect2-Battenberg-PhyloWGS-sr

module load R
Rscript ./parse_num_subclones_PhyloWGS.R \
-i /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-mutect2-battenberg-phylowgs/output/consensus_tree \
-o /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-mutect2-battenberg-phylowgs/output \
-p mutect2_battenberg_phylowgs_sr

output: /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-mutect2-battenberg-phylowgs/output/2023-04-27_num_subclones_mutect2-battenberg-phylowgs_sr.tsv

SomaticSniper-Battenberg-PhyloWGS-sr

module load R
Rscript ./parse_num_subclones_PhyloWGS.R \
-i /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-somaticsniper-battenberg-phylowgs/output/consensus_tree \
-o /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-somaticsniper-battenberg-phylowgs/output \
-p somaticsniper_battenberg_phylowgs_sr

output: /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-somaticsniper-battenberg-phylowgs/output/2023-05-09_num_subclones_somaticsniper_battenberg_phylowgs_sr.tsv

Checklist

philsteinberg commented 1 year ago

Since you've already cleaned up the code to go through all the json files to arrive at my.df, why don't we make that the end output of this parsing script so one never has to look at the jsons again? (though I recognize that the mutations are a different story but we won't worry about it)

I would suggest writing my.df as the output of this script (call it tree or something), and then having a new script that takes all the tree files in a directory (for all sample x seed combinations of a pipeline), and then output the number of subclones per pipeline like the one used in the plotting script. I think that would be a more elegant solution than the concatenating bash bit!

Should I add this part of the original script before writing the my.df to a file ? https://github.com/uclahs-cds/project-CPCGENE-subclonality/blob/ccc7377a82bba591e0868499c83c3cb138e432af/200PT/ssm_cnv_from_json.R#L212-L298

philsteinberg commented 1 year ago

Split the files how you suggested. Running the script on an individual example file works. Get tree df:

Rscript ./parse_consensus_tree_PhyloWGS.R \
-s /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-mutect2-battenberg-phylowgs/output/call-SRC-1.0.0-rc.1/ILHNLNEV000001-T001-P01-F/PhyloWGS-2205be1/output/PhyloWGS-2205be1_366306_ILHNLNEV000001-T001-P01-F_Mutect2-Battenberg-summ.json.gz \
-o /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-mutect2-battenberg-phylowgs/output/consensus_tree

Get number of subclones from tree df:

Rscript ./parse_num_subclones_PhyloWGS.R \
-t /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-mutect2-battenberg-phylowgs/output/consensus_tree/ILHNLNEV000001-T001-P01-F_628019_consensus_tree.txt \
-o /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-mutect2-battenberg-phylowgs/output/

Not sure why this does not run the scripts on all files in the respective directory?

Rscript /hot/users/psteinberg/git/project-SRC-RandomSeed/src/output-analysis/parsing/parse_consensus_tree_PhyloWGS.R \
-s *summ.json.gz \
-o /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-mutect2-battenberg-phylowgs/output/consensus_tree/
Rscript /hot/users/psteinberg/git/project-SRC-RandomSeed/src/output-analysis/parsing/parse_num_subclones_PhyloWGS.R \
-t *consensus_tree.txt \
-o /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-mutect2-battenberg-phylowgs/output/
lydiayliu commented 1 year ago

Should I add this part of the original script before writing the my.df to a file ?

Yes please do, if it's not too much work. I believe that part handles the CCF calculations properly for multi-region samples

Oh sorry I wasn't being clear. I don't think -s *summ.json.gz works because of the way ArgumentParser is set up, but at the same time it's not necessary.

As an example, I was thinking that ./parse_num_subclones_PhyloWGS.R can take as input the directory path of /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-mutect2-battenberg-phylowgs/output/consensus_tree/. Then in the script you can do a list.files() and then loop through each of the file paths, read them in, get number of subclones and save everything in a data frame. Then you can save 1 data frame per pipeline like the one used by the plotting script

lydiayliu commented 1 year ago

I got another review request but don't see any changes. Did you push your commits?

philsteinberg commented 1 year ago

I got another review request but don't see any changes. Did you push your commits?

Ah sorry I just saw the above comment. I have not addressed it yet, will do it after class/lab meeting.

philsteinberg commented 1 year ago

@lydiayliu changes are implemented now!

lydiayliu commented 1 year ago

uh so I tried to commit one minor change (editing out the comment on parse_consensus_tree_phyloWGS.R) but I get the error

Interesting, I have not seen this before. Did you try to push using a different computer, or locally vs on the cluster with ssh or something? It seems like a github credential error.

fatal: Authentication failed for 'https://github.com/uclahs-cds/project-AlgorithmEvaluation-BNCH-000082-SRCRNDSeed/' The https:// here should be the issue, it should be ssh, like git@github.com:uclahs-cds/project-method-AlgorithmEvaluation-BNCH-000082-SRCRNDSeed

I don't mind if you just edit the script on the web and commit the change to this branch / PR! I sometimes take advantage of the interactive online github as well when I'm fixing small things

philsteinberg commented 1 year ago

uh so I tried to commit one minor change (editing out the comment on parse_consensus_tree_phyloWGS.R) but I get the error

Interesting, I have not seen this before. Did you try to push using a different computer, or locally vs on the cluster with ssh or something? It seems like a github credential error.

fatal: Authentication failed for 'https://github.com/uclahs-cds/project-AlgorithmEvaluation-BNCH-000082-SRCRNDSeed/' The https:// here should be the issue, it should be ssh, like git@github.com:uclahs-cds/project-method-AlgorithmEvaluation-BNCH-000082-SRCRNDSeed

I don't mind if you just edit the script on the web and commit the change to this branch / PR! I sometimes take advantage of the interactive online github as well when I'm fixing small things

Works again, thanks!