raphael-group / decifer

DeCiFer is an algorithm that simultaneously selects mutation multiplicities and clusters SNVs by their corresponding descendant cell fractions (DCF).
BSD 3-Clause "New" or "Revised" License
20 stars 7 forks source link

create_input.ipynb fails to generate input file #26

Closed gbnci closed 1 year ago

gbnci commented 1 year ago

Hello, I have some issues related of using create_input.ipynb to generate the input file for decifer analysis. I hope I have followed the instruction exactly (four steps on the Github: https://github.com/raphael-group/decifer/tree/main/scripts/input_from_varscan Step1: I got all mpileup files for 3 samples (including normal and two tumors), then run varscan on two tumor samples and generated two snp files. The file format for snp looks like this: chrom position ref var normal_reads1 normal_reads2 normal_var_freq normal_gt tumor_reads1 tumor_reads2 tumor_var_freq tumor_gt somatic_status variant_p_value somatic_p_value tumor_reads1_plus tumor_reads1_minus tumor_reads2_plus tumor_reads2_minus normal_reads1_plus normal_reads1_minus normal_reads2_plus normal_reads2_minus chr22 10514994 G A 4 10 71.43% R 10 10 50% R Germline 1.6951461011986938E-8 0.9470358906568122 9 1 10 0 3 1 8 2 chr22 10521826 G A 3 15 83.33% A 1 24 96% A Germline 2.687210146800099E-20 0.19009804715987358 1 0 14 10 0 3 11 4 chr22 10522591 C A 8 4 33.33% M 7 5 41.67% M Germline 7.796188798107704E-4 0.500000000000003 2 5 1 4 5 3 3 1

Step2. generated mpileup.tsv files from each tumor samples with file format like this: chr22 10514994 G,A,<> 16,14,0 chr22 10521826 G,A,<> 21,31,0 chr22 10522591 C,A,<> 7,5,0 chr22 10522819 C,T,<> 16,26,0 chr22 10557811 T,C,<> 49,38,0 chr22 10558043 C,T,<> 82,12,0 chr22 10559720 T,C,<*> 26,50,0 Step3: I used best.seg.ucn generated from Hatchet as the CNA file with format like this:

CHR START END SAMPLE cn_normal u_normal cn_clone1 u_clone1 cn_clone2 u_clone2

chr1 1 1215461 Case10_Metastasis 1|1 0.364014 5|1 0.605986 4|2 0.03 chr1 1 1215461 Case10_Tumor 1|1 0.37192 5|1 0.0581929 4|2 0.569887 chr1 1215461 15786182 Case10_Metastasis 1|1 0.364014 6|2 0.605986 4|2 0.03

Step4: For create_input.ipynb, first I change the file paths and names accordingly in the create_input.ipynb and ran the command at our server: module load jupyter module load python jupyter execute create_input.ipynb

After minutes to hours, it finish the job with two file, purity file seems right, but for "decifer.input.tsv", what I got is only header. ############################################ $more decifer.input.tsv 245440 #characters 2 #samples

sample_index sample_label character_index character_label ref var

############################################### no other data in the file. I also try to run whole genome instead of only chr22 shown above, same output. I reset the p value=1 and also reduced the Minreads in the create_input.ipynb file, nothing changes. I am wondering whether you can give some suggestions about my issue. Very appreciated for your help YW

brian-arnold commented 1 year ago

Hi YW, Apologies for my delayed reply! Could you instead try the script in the directory scripts/vcf_2_decifer.py?? Under the section "required input data", this is the script we recommend using. The notebook you've used for varscan input is a bit dated and not supported, as it's tailored specifically to varscan output files instead of VCF files, which is the standard file format for many, if not all, variant callers. Let me know if you have any further quesitons. Brian

gbnci commented 1 year ago

Hi, Brian: Thank you very much for your response. I am indeed working using both Mutect2 and strelka in the past week and seems working well so far. I will surely ask you for any help if I have further questions. Please help me to close my issued request and have a nice day. Thanks YW

From: Brian J Arnold @.> Date: Wednesday, January 25, 2023 at 8:55 PM To: raphael-group/decifer @.> Cc: Wang, Yonghong (NIH/NCI) [E] @.>, Author @.> Subject: [EXTERNAL] Re: [raphael-group/decifer] create_input.ipynb fails to generate input file (Issue #26)

Hi YW, Apologies for my delayed reply! Could you instead try the script in the directory scripts/vcf_2_decifer.py?https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fraphael-group%2Fdecifer%2Fblob%2Fmain%2Fscripts%2Fvcf_2_decifer.py&data=05%7C01%7Cwangyong%40mail.nih.gov%7C15515ecc688b4ca8315108daff405b8e%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C638102949154407651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2B%2Bp9IqwvJ%2BslLA66doiuDp27a0zUnuHG8xoGCvxx%2Bhc%3D&reserved=0? Under the section "required input data", this is the script we recommend using. The notebook you've used for varscan input is a bit dated and not supported, as it's tailored specifically to varscan output files instead of VCF files, which is the standard file format for many, if not all, variant callers. Let me know if you have any further quesitons. Brian

— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fraphael-group%2Fdecifer%2Fissues%2F26%23issuecomment-1404467223&data=05%7C01%7Cwangyong%40mail.nih.gov%7C15515ecc688b4ca8315108daff405b8e%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C638102949154407651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=rM0WT2%2F16p15vPBPgBsmT9%2FG5DeRNQPWP%2BYAGiZkGSM%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAI62R6F7Y6T4JB24FJTNGJDWUHKPZANCNFSM6AAAAAAUAWKCPE&data=05%7C01%7Cwangyong%40mail.nih.gov%7C15515ecc688b4ca8315108daff405b8e%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C638102949154407651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Mqo%2BMsxnazDAFFMiGrjx98M4wwd0k9NYhdtwG1dfvLg%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.

brian-arnold commented 1 year ago

That sounds good. Both of those variant callers should output VCF files. Just as a heads up, the vcf_2_decifer.py script works best if you've done mutli-sample calling for each patient. That is, if there are multiple tumor samples for a particular patient, do variant calling on all of these simultaneously such that you get a single VCF for the patient with multiple sample columns, one sample column per tumor sample.