Closed philsteinberg closed 1 year ago
You can reverse engineer the VCF from my FACETS tsv output using the script in the cnv_facets package that converts the TSV output by FACETS into a VCF. Towards the end of https://github.com/dariober/cnv_facets/blob/master/bin/cnv_facets.R
Example path: /hot/project/disease/HeadNeckTumor/HNSC-000084-LNMEvolution/data/FACETS/cnv_facets/selected_cval/ILHNLNEV000001-T001-P01-F/
Example files:
ILHNLNEV000001-T001-P01-F.cncf.txt
ILHNLNEV000001-T001-P01-F.ploidy.txt
(get "purity")ILHNLNEV000001-T001-P01-F.vcf.gz
(desired format)Lydia comments:
cf.em
(cellular prevalence)
tcn.em
(total copy number)
lcn.em
(minor)
Hi @lydiayliu , I had a couple questions for getting started with the FACETS.py
parser.
The information I need to extract: https://github.com/uclahs-cds/tool-SRC-util/blob/4ca54f4bdf13d277284131ebdc6e787430a43c26/src_util/hatchet.py#L85-L93.
The input I should use: pipeline-call-sCNA
output found in /hot/project/disease/HeadNeckTumor/HNSC-000084-LNMEvolution/data/FACETS/cnv_facets/selected_cval/ILHNLNEV000001-T001-P01-F
*.vcf.gz
(Allele-specific sCNAs detected by FACETS in VCF format).
From the header in the VCF I would assume the following correspond:
Several questions:
cnv
, and I also need to classify whether they are clonal (CF_EM=1
or CF_EM=purity
) or subclonal (CF_EM<1
)CF_EM
is described as Cell Fraction, can I assume this is CCF? And if so, can I use the purity from the header to calculate prevalence?classify_cnv
in https://github.com/dariober/cnv_facets/blob/44fdd18a4beb4f57c6e6e746d31a6399efdbc85b/bin/cnv_facets.R#L563-L574 in my parser?*cncf.txt
and *ploidy.txt
files, should I use the vcf_parser.py
or the battenberg.py
parser as reference? (I am also not sure where to find the FACETS to Battenberg conversion file that you mentioned in our meeting)(Sorry for the overload in questions)
Great questions!
Btw start
: POS
For the first two, I have forwarded you an email thread where I asked the author of FACETS the exact questions!
I also need to classify whether they are clonal (CF_EM=1 or CF_EM=purity) or subclonal (CF_EM<1)
Do you really need this for the parser? I think it is good to have a parameter cut off (based on CCF, not CF though) so in the cases were a CNA is considered clonal we adjust the CCF to 1 and CF to purity. Again, see email thread and this should be more clear.
Do I need to consider all the cases in the classify_cnv
I don't think you need the classification of the CNV for the parser anywere?
Since I am working with the VCF and not the cncf.txt and ploidy.txt files, should I use the vcf_parser.py or the battenberg.py parser as reference?
I think the vcf parser would be more informative, but idk how well this output actually follows the VCF format. Do consult everything for inspirations! Here's the script for making a facets txt file into a Battenberg-style file: https://github.com/uclahs-cds/project-disease-HeadNeckTumor-HNSC-000084-LNMEvolution/blob/master/launch_cyclocloud/create.battenberg.style.facets.R
Notes based on the email:
'prevalence': 'CF_EM'
clonal: CCF = 1
, CF = purity
Notes from conversion file: Calculate ccf (clonal vs if not clonal)
sample_cna$ccf.em <- ifelse(sample_cna$cf.em == 1, 1,
sample_cna$cf.em / sample_ploidy$purity)
Get data as needed above ^^
subclones$nMaj1_A <- sample_cna$tcn.em - sample_cna$lcn.em
subclones$nMin1_A <- sample_cna$lcn.em
subclones$frac1_A <- sample_cna$ccf
Thoughts on code for now:
adapt function https://github.com/uclahs-cds/tool-SRC-util/blob/cc59e28191c91f35731524b911f04c24f8b4e8b9/src_util/vcf_parser.py#L8-L46 from vcf_parser.py
LCN_EM = int(re.findall(r'LCN_EM=(\d+)', variant.INFO)[0])
TCN_EM = int(re.findall(r'TCN_EM=(\d+)', variant.INFO)[0])
CF_EM = int(re.findall(r'CF_EM=(\d+)', variant.INFO)[0])
variants.append({
'sample': sample.sample,
'chr': variant.CHROM,
'pos': variant.POS,
'end': variant.END,
'major': TCN_EM - LCN_EM,
'minor': LCN_EM,
'prevalence': CF_EM
})
Notes looking good!
Note that we probably want a parameter that determines which CNAs are "clonal". I'm thinking maybe everything CCF > 0.95
or something. Then all CNAs with CCF > 0.95 would have CCF adjusted to 1.
LCN_EM = int(re.findall(r'LCN_EM=(\d+)', variant.INFO)[0])
is regex the best way to get the desired field from the INFO column? It's surprising that's not something parsed by vcf.Reader
. But it is what it is!
LCN_EM = int(re.findall(r'LCN_EM=(\d+)', variant.INFO)[0])
TCN_EM = int(re.findall(r'LCN_EM=(\d+)', variant.INFO)[0])
both say re.findall(r'LCN_EM=(\d+)
?
Another thought is that we probably still need to add the other copy number state (1 - ccf portion that is assumed to be diploid balanced) so that for tools like PyClone, the subclonal copy number CNAs are automatically excluded
Thanks for the notes. I can double check the output by cvf.Reader
.
I see, so PyClone only clonal (CCF > 0.95
), DPClust clonal and subclonal.
'FACETS-PyClone-VI': (parse_battenberg, {'allowed_num_states': [1]}),
'FACETS-PhyloWGS': (parse_battenberg, {}),
'FACETS-DPClust': (parse_battenberg, {'allowed_num_states': [1,2]}),
Will look at Battenberg parser for more details on that
Btw we don't have the allowed_num_states
for PhyloWGS because it's already in the parser that we ripped from PhyloWGS itself.
HI @yashpatel6 I think I need some help with the FACETS parser. I have created a branch and PR just so that you could see the code I have so far. I have tested individual functions for whether they work but do not have a fully functioning parser yet.
I was looking at the HATCHET and FACETS input examples from #21 and the FACETS output /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/test-parsers/battenberg/pyclone_vi_input.tsv
. Just to be clear the output format should look like this:
mutation_id sample_id ref_counts alt_counts normal_cn major_cn minor_cn tumour_content
2_1480776_G_T ILHNLNEV000001-N002-A01-F 15 22 2 1 1 0.905
2_40720100_C_A ILHNLNEV000001-N002-A01-F 28 31 2 1 1 0.905
2_48564587_C_T ILHNLNEV000001-N002-A01-F 8 10 2 1 1 0.905
If not, is there perhaps a HATCHETs output example file that I can look at to make sure I understand what the parser does?
Other questions/comments I have:
LCN_EM
, TCN_EN
, CF_EM
out of the INFO
column string, and purity from the VCF header using regrex. Is there a better, less hard coded way to do this?CCF
somewhere?I would appreciate some comments/help on my code, thank you!
HI @yashpatel6 I think I need some help with the FACETS parser. I have created a branch and PR just so that you could see the code I have so far. I have tested individual functions for whether they work but do not have a fully functioning parser yet. I was looking at the HATCHET and FACETS input examples from #21 and the FACETS output
/hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/test-parsers/battenberg/pyclone_vi_input.tsv
. Just to be clear the output format should look like this:mutation_id sample_id ref_counts alt_counts normal_cn major_cn minor_cn tumour_content 2_1480776_G_T ILHNLNEV000001-N002-A01-F 15 22 2 1 1 0.905 2_40720100_C_A ILHNLNEV000001-N002-A01-F 28 31 2 1 1 0.905 2_48564587_C_T ILHNLNEV000001-N002-A01-F 8 10 2 1 1 0.905
If not, is there perhaps a HATCHETs output example file that I can look at to make sure I understand what the parser does?
So what you're trying to add is a parser to extract CNA information from FACETS output. The information we want to get out is similar to what the other CNA parsers produce: https://github.com/uclahs-cds/tool-SRC-util/blob/cc59e28191c91f35731524b911f04c24f8b4e8b9/src_util/battenberg.py#L112-L122
The file you have linked above is the input for one of the SRC tools, which is generated with the combination of CNA and SNV information. For FACETS, you only have to handle taking in the input and converting it to include the information in the lines I linked above. Basically, you'll want the output of the parser to be a list of CNAs and each CNA will have the information in those lines (extracted from the FACETS input). Along with that, you'll want to return a purity estimate per sample.
Other questions/comments I have:
* [ ] At first I was going to work off of the vcf_parser.py because the input is a VCF. However, I realized I would have to change core functions such as VariantParser since I want different information than what is specified in that function. Therefore, I modeled the FACETS parser most closely to Battenberg. * [ ] As a result, I had extracted `LCN_EM`, `TCN_EN`, `CF_EM` out of the `INFO` column string, and purity from the VCF header using regrex. Is there a better, less hard coded way to do this?
Hm if the output format is a proper VCF, I would recommend using the vcf
library (like in the vcf_parser
) to parse and extract the fields. That way there won't be as much hard-coded regex and the library can handle parsing.
* [ ] I am not sure my logic for registering the first and second clone are correct. Also, should I be storing the `CCF` somewhere?
CCF
isn't directly required, you'll want to make sure to output the sample purity and the prev/frac for each CNA
* [ ] Not sure why the find_indices function is not working. Do I need to modify this more for the VCF?
I think the issue is that the find_indices
function would need to be updated to parse the column names that match the file you're using. Though if it's a VCF file, I would go with the suggestion of using the library to parse it for you.
@yashpatel6 Thanks! I will try to modify my code in the PR to use the vcf_parser
once I get back from my grad trip.
And thanks, yes, I am outputting the purity
and CP
at the moment. Right, I am only using the CCF
values to classify whether a cnv
is clonal
or subclonal
.
FACETS needs additional parameters. It is easier to just convert the output files into the right format, by making a VCF out of the current output (requires a src_util parser).
test data correct input format:
/hot/software/pipeline/pipeline-call-SRC/Nextflow/development/test_inputs/data/SARC0121_DMSO/facets/
data to be formatted:
/hot/project/disease/HeadNeckTumor/HNSC-000084-LNMEvolution/data/FACETS/cnv_facets/selected_cval/ILHNLNEV000001-T001-P01-F