uclahs-cds / project-method-AlgorithmEvaluation-BNCH-000082-SRCRNDSeed

GNU General Public License v2.0
1 stars 0 forks source link

Convert FACETS output #60

Closed philsteinberg closed 1 year ago

philsteinberg commented 1 year ago

FACETS needs additional parameters. It is easier to just convert the output files into the right format, by making a VCF out of the current output (requires a src_util parser).

test data correct input format: /hot/software/pipeline/pipeline-call-SRC/Nextflow/development/test_inputs/data/SARC0121_DMSO/facets/

data to be formatted: /hot/project/disease/HeadNeckTumor/HNSC-000084-LNMEvolution/data/FACETS/cnv_facets/selected_cval/ILHNLNEV000001-T001-P01-F

lydiayliu commented 1 year ago

You can reverse engineer the VCF from my FACETS tsv output using the script in the cnv_facets package that converts the TSV output by FACETS into a VCF. Towards the end of https://github.com/dariober/cnv_facets/blob/master/bin/cnv_facets.R

philsteinberg commented 1 year ago

Example path: /hot/project/disease/HeadNeckTumor/HNSC-000084-LNMEvolution/data/FACETS/cnv_facets/selected_cval/ILHNLNEV000001-T001-P01-F/

Example files:

philsteinberg commented 1 year ago

Lydia comments:

cf.em (cellular prevalence) tcn.em (total copy number) lcn.em (minor)

philsteinberg commented 1 year ago

Hi @lydiayliu , I had a couple questions for getting started with the FACETS.py parser.

The information I need to extract: https://github.com/uclahs-cds/tool-SRC-util/blob/4ca54f4bdf13d277284131ebdc6e787430a43c26/src_util/hatchet.py#L85-L93.

The input I should use: pipeline-call-sCNA output found in /hot/project/disease/HeadNeckTumor/HNSC-000084-LNMEvolution/data/FACETS/cnv_facets/selected_cval/ILHNLNEV000001-T001-P01-F *.vcf.gz (Allele-specific sCNAs detected by FACETS in VCF format).

From the header in the VCF I would assume the following correspond:

Several questions:

(Sorry for the overload in questions)

lydiayliu commented 1 year ago

Great questions!

Btw start: POS

For the first two, I have forwarded you an email thread where I asked the author of FACETS the exact questions!

I also need to classify whether they are clonal (CF_EM=1 or CF_EM=purity) or subclonal (CF_EM<1)

Do you really need this for the parser? I think it is good to have a parameter cut off (based on CCF, not CF though) so in the cases were a CNA is considered clonal we adjust the CCF to 1 and CF to purity. Again, see email thread and this should be more clear.

Do I need to consider all the cases in the classify_cnv

I don't think you need the classification of the CNV for the parser anywere?

Since I am working with the VCF and not the cncf.txt and ploidy.txt files, should I use the vcf_parser.py or the battenberg.py parser as reference?

I think the vcf parser would be more informative, but idk how well this output actually follows the VCF format. Do consult everything for inspirations! Here's the script for making a facets txt file into a Battenberg-style file: https://github.com/uclahs-cds/project-disease-HeadNeckTumor-HNSC-000084-LNMEvolution/blob/master/launch_cyclocloud/create.battenberg.style.facets.R

philsteinberg commented 1 year ago

Notes based on the email:

'prevalence': 'CF_EM' clonal: CCF = 1, CF = purity

philsteinberg commented 1 year ago

Notes from conversion file: Calculate ccf (clonal vs if not clonal)

sample_cna$ccf.em <- ifelse(sample_cna$cf.em == 1, 1,
    sample_cna$cf.em / sample_ploidy$purity)

Get data as needed above ^^

subclones$nMaj1_A <- sample_cna$tcn.em - sample_cna$lcn.em
subclones$nMin1_A <- sample_cna$lcn.em
subclones$frac1_A <- sample_cna$ccf
philsteinberg commented 1 year ago

Thoughts on code for now: adapt function https://github.com/uclahs-cds/tool-SRC-util/blob/cc59e28191c91f35731524b911f04c24f8b4e8b9/src_util/vcf_parser.py#L8-L46 from vcf_parser.py

LCN_EM = int(re.findall(r'LCN_EM=(\d+)', variant.INFO)[0]) 
TCN_EM = int(re.findall(r'TCN_EM=(\d+)', variant.INFO)[0]) 
CF_EM = int(re.findall(r'CF_EM=(\d+)', variant.INFO)[0]) 

variants.append({
    'sample': sample.sample,
    'chr': variant.CHROM,
    'pos': variant.POS,
    'end': variant.END,
    'major': TCN_EM - LCN_EM,
    'minor': LCN_EM,
    'prevalence': CF_EM
})
lydiayliu commented 1 year ago

Notes looking good!

Note that we probably want a parameter that determines which CNAs are "clonal". I'm thinking maybe everything CCF > 0.95 or something. Then all CNAs with CCF > 0.95 would have CCF adjusted to 1.

LCN_EM = int(re.findall(r'LCN_EM=(\d+)', variant.INFO)[0]) is regex the best way to get the desired field from the INFO column? It's surprising that's not something parsed by vcf.Reader. But it is what it is!

LCN_EM = int(re.findall(r'LCN_EM=(\d+)', variant.INFO)[0]) 
TCN_EM = int(re.findall(r'LCN_EM=(\d+)', variant.INFO)[0]) 

both say re.findall(r'LCN_EM=(\d+)?

lydiayliu commented 1 year ago

Another thought is that we probably still need to add the other copy number state (1 - ccf portion that is assumed to be diploid balanced) so that for tools like PyClone, the subclonal copy number CNAs are automatically excluded

https://github.com/uclahs-cds/tool-SRC-util/blob/cc59e28191c91f35731524b911f04c24f8b4e8b9/src_util/src_format.py#L111-L122

philsteinberg commented 1 year ago

Thanks for the notes. I can double check the output by cvf.Reader.

I see, so PyClone only clonal (CCF > 0.95), DPClust clonal and subclonal.

'FACETS-PyClone-VI': (parse_battenberg, {'allowed_num_states': [1]}),
'FACETS-PhyloWGS': (parse_battenberg, {}),
'FACETS-DPClust': (parse_battenberg, {'allowed_num_states': [1,2]}),

Will look at Battenberg parser for more details on that

lydiayliu commented 1 year ago

Btw we don't have the allowed_num_states for PhyloWGS because it's already in the parser that we ripped from PhyloWGS itself.

https://github.com/uclahs-cds/tool-SRC-util/blob/cc59e28191c91f35731524b911f04c24f8b4e8b9/src_util/phylowgs_util.py#L468-L526

philsteinberg commented 1 year ago

HI @yashpatel6 I think I need some help with the FACETS parser. I have created a branch and PR just so that you could see the code I have so far. I have tested individual functions for whether they work but do not have a fully functioning parser yet. I was looking at the HATCHET and FACETS input examples from #21 and the FACETS output /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/test-parsers/battenberg/pyclone_vi_input.tsv. Just to be clear the output format should look like this:

mutation_id     sample_id       ref_counts      alt_counts      normal_cn       major_cn        minor_cn        tumour_content
2_1480776_G_T   ILHNLNEV000001-N002-A01-F       15      22      2       1       1       0.905
2_40720100_C_A  ILHNLNEV000001-N002-A01-F       28      31      2       1       1       0.905
2_48564587_C_T  ILHNLNEV000001-N002-A01-F       8       10      2       1       1       0.905

If not, is there perhaps a HATCHETs output example file that I can look at to make sure I understand what the parser does?

Other questions/comments I have:

I would appreciate some comments/help on my code, thank you!

yashpatel6 commented 1 year ago

HI @yashpatel6 I think I need some help with the FACETS parser. I have created a branch and PR just so that you could see the code I have so far. I have tested individual functions for whether they work but do not have a fully functioning parser yet. I was looking at the HATCHET and FACETS input examples from #21 and the FACETS output /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/test-parsers/battenberg/pyclone_vi_input.tsv. Just to be clear the output format should look like this:

mutation_id     sample_id       ref_counts      alt_counts      normal_cn       major_cn        minor_cn        tumour_content
2_1480776_G_T   ILHNLNEV000001-N002-A01-F       15      22      2       1       1       0.905
2_40720100_C_A  ILHNLNEV000001-N002-A01-F       28      31      2       1       1       0.905
2_48564587_C_T  ILHNLNEV000001-N002-A01-F       8       10      2       1       1       0.905

If not, is there perhaps a HATCHETs output example file that I can look at to make sure I understand what the parser does?

So what you're trying to add is a parser to extract CNA information from FACETS output. The information we want to get out is similar to what the other CNA parsers produce: https://github.com/uclahs-cds/tool-SRC-util/blob/cc59e28191c91f35731524b911f04c24f8b4e8b9/src_util/battenberg.py#L112-L122

The file you have linked above is the input for one of the SRC tools, which is generated with the combination of CNA and SNV information. For FACETS, you only have to handle taking in the input and converting it to include the information in the lines I linked above. Basically, you'll want the output of the parser to be a list of CNAs and each CNA will have the information in those lines (extracted from the FACETS input). Along with that, you'll want to return a purity estimate per sample.

Other questions/comments I have:

* [ ]  At first I was going to work off of the vcf_parser.py because the input is a VCF. However, I realized I would have to change core functions such as VariantParser since I want different information than what is specified in that function. Therefore, I modeled the FACETS parser most closely to Battenberg.

* [ ]  As a result, I had extracted `LCN_EM`, `TCN_EN`, `CF_EM` out of the `INFO` column string, and purity from the VCF header using regrex. Is there a better, less hard coded way to do this?

Hm if the output format is a proper VCF, I would recommend using the vcf library (like in the vcf_parser) to parse and extract the fields. That way there won't be as much hard-coded regex and the library can handle parsing.

* [ ]  I am not sure my logic for registering the first and second clone are correct. Also, should I be storing the `CCF` somewhere?

CCF isn't directly required, you'll want to make sure to output the sample purity and the prev/frac for each CNA

* [ ]  Not sure why the find_indices function is not working. Do I need to modify this more for the VCF?

I think the issue is that the find_indices function would need to be updated to parse the column names that match the file you're using. Though if it's a VCF file, I would go with the suggestion of using the library to parse it for you.

philsteinberg commented 1 year ago

@yashpatel6 Thanks! I will try to modify my code in the PR to use the vcf_parser once I get back from my grad trip.

And thanks, yes, I am outputting the purity and CP at the moment. Right, I am only using the CCF values to classify whether a cnv is clonal or subclonal.