Open trptyrphe11 opened 7 years ago
Not having used either sequenza or PyClone I am not sure what the format is. It will be helpful if you provide a sample format file to see if it can be generated easily.
Sorry for not being clear. In general PyClone takes a tab delimited file with a header as input.
The required fields are: mutation_id - A unique ID to identify the mutation. Good names are thing such a the genomic co-ordinates of the mutation i.e. chr22:12345. Gene names are not good IDs because one gene may have multiple mutations, in which case the ID is not unique and PyClone will fail to run or worse give unexpected results. If you want to include the gene name I suggest adding the genomic coordinates i.e. TP53_chr17:753342.
ref_counts - The number of reads covering the mutation which contain the reference (genome) allele.
var_counts - The number of reads covering the mutation which contain the variant allele.
normal_cn - The copy number of the cells in the normal population. For autosomal chromosomes this will be 2 and for sex chromosomes it could be either 1 or 2. For species besides human other values are possible.
minor_cn - The minor copy number of the cancer cells. Usually this value will be predicted from WGSS or array data.
major_cn - The major copy number of the cancer cells. Usually this value will be predicted from WGSS or array data.
Example tsv looks like: mutation_id ref_counts var_counts normal_cn minor_cn major_cn variant_case variant_freq genotype NA12156:BB:chr2:175263063 3812 14 2 0 2 NA12156 0.0036591740721380033 BB
FACETS doesn't call mutations. So you need to merge a file with mutations (called using your favorite mutation caller) with the copy number calls from FACETS to generate this file. Extracting the copy number information for a given position should be easy from the segmentation table of the output.
To make sure I extract the right information, do you mean merge procSample output's jointseg dataframe with fit$cncf data frame, with columns lcn.em represents the minor cn, (tcn.em - lcn.em) represents the major cn? Thanks.
If you are using the current version of FACETS, you would only need fit$cncf. In that dataframe the columns "start" and "end" give the genomic position where the segment starts and ends.
I see. Is it start 0-based end 1-based as bed format or all 1-based as vcf format? Thanks.
1-based
One more question when I examine the output more closely: I saw several segments has lcn.em of NA (~15%). When I am integrating the result with my variant file and prepare for pyClone to estimate mutation clonality, shall I replace those minor copy number with 0 or shall I filter out those mutations? Thanks.
Filtering may be a better idea. These are typically focal changes. So If tcn is large you can see if including them with lcn=1 will give you sensible answers.
so
. So If tcn is large you can see if including them with lcn=1 will give you sensible answers.
so how large is large, thanks a lot
I used the facets and loved it a lot for speed and easy implementation. I am wondering is there a function to generate input files for pyclone (like the sequenza function sequenza2pyclone)? Thanks.