Convert FACETS output - Githubissues

philsteinberg commented 1 year ago

FACETS needs additional parameters. It is easier to just convert the output files into the right format, by making a VCF out of the current output (requires a src_util parser).

test data correct input format: /hot/software/pipeline/pipeline-call-SRC/Nextflow/development/test_inputs/data/SARC0121_DMSO/facets/

data to be formatted: /hot/project/disease/HeadNeckTumor/HNSC-000084-LNMEvolution/data/FACETS/cnv_facets/selected_cval/ILHNLNEV000001-T001-P01-F

lydiayliu commented 1 year ago

You can reverse engineer the VCF from my FACETS tsv output using the script in the cnv_facets package that converts the TSV output by FACETS into a VCF. Towards the end of https://github.com/dariober/cnv_facets/blob/master/bin/cnv_facets.R

philsteinberg commented 1 year ago

Example path: /hot/project/disease/HeadNeckTumor/HNSC-000084-LNMEvolution/data/FACETS/cnv_facets/selected_cval/ILHNLNEV000001-T001-P01-F/

Example files:

ILHNLNEV000001-T001-P01-F.cncf.txt
ILHNLNEV000001-T001-P01-F.ploidy.txt (get "purity")
ILHNLNEV000001-T001-P01-F.vcf.gz(desired format)

philsteinberg commented 1 year ago

Lydia comments:

CCF = cellular prevelence / cellularity (purity)
purity (% of sample sequence that is the tumor)

cf.em (cellular prevalence) tcn.em (total copy number) lcn.em (minor)

philsteinberg commented 1 year ago

Hi @lydiayliu , I had a couple questions for getting started with the FACETS.py parser.

The information I need to extract: https://github.com/uclahs-cds/tool-SRC-util/blob/4ca54f4bdf13d277284131ebdc6e787430a43c26/src_util/hatchet.py#L85-L93.

The input I should use: pipeline-call-sCNA output found in /hot/project/disease/HeadNeckTumor/HNSC-000084-LNMEvolution/data/FACETS/cnv_facets/selected_cval/ILHNLNEV000001-T001-P01-F *.vcf.gz (Allele-specific sCNAs detected by FACETS in VCF format).

From the header in the VCF I would assume the following correspond:

'sample': (get from file name)
'chr': 'CHROM'
'start': 'POS'
'end': 'END'
'major': 'TCN_EM' - 'LCN_EM'
'minor': 'LCN_EM'
'prevalence': 'CF_EM' * purity

Several questions:

each row is a cnv, and I also need to classify whether they are clonal (CF_EM=1 or CF_EM=purity) or subclonal (CF_EM<1)
since CF_EM is described as Cell Fraction, can I assume this is CCF? And if so, can I use the purity from the header to calculate prevalence?
Do I need to consider all the cases in the classify_cnv in https://github.com/dariober/cnv_facets/blob/44fdd18a4beb4f57c6e6e746d31a6399efdbc85b/bin/cnv_facets.R#L563-L574 in my parser?
Since I am working with the VCF and not the *cncf.txt and *ploidy.txt files, should I use the vcf_parser.py or the battenberg.py parser as reference? (I am also not sure where to find the FACETS to Battenberg conversion file that you mentioned in our meeting)

(Sorry for the overload in questions)

lydiayliu commented 1 year ago

Great questions!

Btw start: POS

For the first two, I have forwarded you an email thread where I asked the author of FACETS the exact questions!

I also need to classify whether they are clonal (CF_EM=1 or CF_EM=purity) or subclonal (CF_EM<1)

Do you really need this for the parser? I think it is good to have a parameter cut off (based on CCF, not CF though) so in the cases were a CNA is considered clonal we adjust the CCF to 1 and CF to purity. Again, see email thread and this should be more clear.

Do I need to consider all the cases in the classify_cnv

I don't think you need the classification of the CNV for the parser anywere?

Since I am working with the VCF and not the cncf.txt and ploidy.txt files, should I use the vcf_parser.py or the battenberg.py parser as reference?

I think the vcf parser would be more informative, but idk how well this output actually follows the VCF format. Do consult everything for inspirations! Here's the script for making a facets txt file into a Battenberg-style file: https://github.com/uclahs-cds/project-disease-HeadNeckTumor-HNSC-000084-LNMEvolution/blob/master/launch_cyclocloud/create.battenberg.style.facets.R

philsteinberg commented 1 year ago

Notes based on the email:

The quantity cf we report is the cellular fraction in the whole sample (including normal). Normal diploid is presentin both tumor and normal and so will have cf of 1 always. Cancer cell fraction (ccf) is different from cf in FACETS output. It is cf/purity.

'prevalence': 'CF_EM' clonal: CCF = 1, CF = purity

philsteinberg commented 1 year ago

Notes from conversion file: Calculate ccf (clonal vs if not clonal)

sample_cna$ccf.em <- ifelse(sample_cna$cf.em == 1, 1,
    sample_cna$cf.em / sample_ploidy$purity)

Get data as needed above ^^

subclones$nMaj1_A <- sample_cna$tcn.em - sample_cna$lcn.em
subclones$nMin1_A <- sample_cna$lcn.em
subclones$frac1_A <- sample_cna$ccf

philsteinberg commented 1 year ago

Thoughts on code for now: adapt function https://github.com/uclahs-cds/tool-SRC-util/blob/cc59e28191c91f35731524b911f04c24f8b4e8b9/src_util/vcf_parser.py#L8-L46 from vcf_parser.py

LCN_EM = int(re.findall(r'LCN_EM=(\d+)', variant.INFO)[0]) 
TCN_EM = int(re.findall(r'TCN_EM=(\d+)', variant.INFO)[0]) 
CF_EM = int(re.findall(r'CF_EM=(\d+)', variant.INFO)[0]) 

variants.append({
    'sample': sample.sample,
    'chr': variant.CHROM,
    'pos': variant.POS,
    'end': variant.END,
    'major': TCN_EM - LCN_EM,
    'minor': LCN_EM,
    'prevalence': CF_EM
})

lydiayliu commented 1 year ago

Notes looking good!

Note that we probably want a parameter that determines which CNAs are "clonal". I'm thinking maybe everything CCF > 0.95 or something. Then all CNAs with CCF > 0.95 would have CCF adjusted to 1.

LCN_EM = int(re.findall(r'LCN_EM=(\d+)', variant.INFO)[0]) is regex the best way to get the desired field from the INFO column? It's surprising that's not something parsed by vcf.Reader. But it is what it is!

LCN_EM = int(re.findall(r'LCN_EM=(\d+)', variant.INFO)[0]) 
TCN_EM = int(re.findall(r'LCN_EM=(\d+)', variant.INFO)[0])

both say re.findall(r'LCN_EM=(\d+)?

lydiayliu commented 1 year ago

Another thought is that we probably still need to add the other copy number state (1 - ccf portion that is assumed to be diploid balanced) so that for tools like PyClone, the subclonal copy number CNAs are automatically excluded

https://github.com/uclahs-cds/tool-SRC-util/blob/cc59e28191c91f35731524b911f04c24f8b4e8b9/src_util/src_format.py#L111-L122

philsteinberg commented 1 year ago

Thanks for the notes. I can double check the output by cvf.Reader.

I see, so PyClone only clonal (CCF > 0.95), DPClust clonal and subclonal.

'FACETS-PyClone-VI': (parse_battenberg, {'allowed_num_states': [1]}),
'FACETS-PhyloWGS': (parse_battenberg, {}),
'FACETS-DPClust': (parse_battenberg, {'allowed_num_states': [1,2]}),

Will look at Battenberg parser for more details on that

lydiayliu commented 1 year ago

Btw we don't have the allowed_num_states for PhyloWGS because it's already in the parser that we ripped from PhyloWGS itself.

https://github.com/uclahs-cds/tool-SRC-util/blob/cc59e28191c91f35731524b911f04c24f8b4e8b9/src_util/phylowgs_util.py#L468-L526

philsteinberg commented 1 year ago

HI @yashpatel6 I think I need some help with the FACETS parser. I have created a branch and PR just so that you could see the code I have so far. I have tested individual functions for whether they work but do not have a fully functioning parser yet. I was looking at the HATCHET and FACETS input examples from #21 and the FACETS output /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/test-parsers/battenberg/pyclone_vi_input.tsv. Just to be clear the output format should look like this:

mutation_id     sample_id       ref_counts      alt_counts      normal_cn       major_cn        minor_cn        tumour_content
2_1480776_G_T   ILHNLNEV000001-N002-A01-F       15      22      2       1       1       0.905
2_40720100_C_A  ILHNLNEV000001-N002-A01-F       28      31      2       1       1       0.905
2_48564587_C_T  ILHNLNEV000001-N002-A01-F       8       10      2       1       1       0.905

If not, is there perhaps a HATCHETs output example file that I can look at to make sure I understand what the parser does?

Other questions/comments I have:

[ ] At first I was going to work off of the vcf_parser.py because the input is a VCF. However, I realized I would have to change core functions such as VariantParser since I want different information than what is specified in that function. Therefore, I modeled the FACETS parser most closely to Battenberg.
[ ] As a result, I had extracted LCN_EM, TCN_EN, CF_EM out of the INFO column string, and purity from the VCF header using regrex. Is there a better, less hard coded way to do this?
[ ] I am not sure my logic for registering the first and second clone are correct. Also, should I be storing the CCF somewhere?
[ ] Not sure why the find_indices function is not working. Do I need to modify this more for the VCF?

I would appreciate some comments/help on my code, thank you!

yashpatel6 commented 1 year ago

HI @yashpatel6 I think I need some help with the FACETS parser. I have created a branch and PR just so that you could see the code I have so far. I have tested individual functions for whether they work but do not have a fully functioning parser yet. I was looking at the HATCHET and FACETS input examples from #21 and the FACETS output /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/test-parsers/battenberg/pyclone_vi_input.tsv. Just to be clear the output format should look like this:
mutation_id     sample_id       ref_counts      alt_counts      normal_cn       major_cn        minor_cn        tumour_content
2_1480776_G_T   ILHNLNEV000001-N002-A01-F       15      22      2       1       1       0.905
2_40720100_C_A  ILHNLNEV000001-N002-A01-F       28      31      2       1       1       0.905
2_48564587_C_T  ILHNLNEV000001-N002-A01-F       8       10      2       1       1       0.905
If not, is there perhaps a HATCHETs output example file that I can look at to make sure I understand what the parser does?

So what you're trying to add is a parser to extract CNA information from FACETS output. The information we want to get out is similar to what the other CNA parsers produce: https://github.com/uclahs-cds/tool-SRC-util/blob/cc59e28191c91f35731524b911f04c24f8b4e8b9/src_util/battenberg.py#L112-L122

The file you have linked above is the input for one of the SRC tools, which is generated with the combination of CNA and SNV information. For FACETS, you only have to handle taking in the input and converting it to include the information in the lines I linked above. Basically, you'll want the output of the parser to be a list of CNAs and each CNA will have the information in those lines (extracted from the FACETS input). Along with that, you'll want to return a purity estimate per sample.

uclahs-cds / project-method-AlgorithmEvaluation-BNCH-000082-SRCRNDSeed

Convert FACETS output #60