vahidAK / NanoMethPhase

Methylation Phasing for Nanopore Sequencing
GNU General Public License v3.0
44 stars 4 forks source link

VCF file 'Format' and 'Sample' columns #1

Closed skambha6 closed 3 years ago

skambha6 commented 3 years ago

Hi,

I have a VCF file that doesn't have the optional 'Format' and 'Sample' columns, so a result of running nanomethphase phase I get the following error: NanoMethPhase selected output format(s): bam Traceback (most recent call last): File "/home-4/skambha6@jhu.edu/.local/bin/nanomethphase", line 10, in sys.exit(main()) File "/home-4/skambha6@jhu.edu/.local/lib/python3.8/site-packages/nanomethphase/main.py", line 1996, in main args.func(args) File "/home-4/skambha6@jhu.edu/.local/lib/python3.8/site-packages/nanomethphase/main.py", line 709, in main_phase vcf_dict = vcf2dict_phase(vcf_file,args.window) File "/home-4/skambha6@jhu.edu/.local/lib/python3.8/site-packages/nanomethphase/main.py", line 549, in vcf2dict_phase if line_list[9].startswith('1|0') or line_list[9].startswith('0|1'): IndexError: list index out of range

Is it possible to adjust the way vcf2dict_phase reads in the vcf file to not rely on the 'Format' or 'Sample' columns without hindering any of the downstream phasing?

Thank you! Sandeep

vahidAK commented 3 years ago

Hi @skambha6 , Is you vcf file already been phased? We require the file column to be like:

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA19240  
chr1    11002   .       A       C       815.155 .       .       GT:GQ:DP:AF     0/1:1025:31:0.5484  
chr1    11035   .       G       A       725.874 .       .       GT:GQ:DP:AF     0/1:930:31:0.5161  
chr1    11113   .       T       C       806.659 .       .       GT:GQ:DP:AF     0|1:1021:31:0.4194

the file must be 10 columns and important columns are: 1st: chromosome name 2nd: position of the variant 4th: reference base 5th: Alternative/mutation base 10th: The sample column. If it only includes the zygosity and phase information will be enough. Phase information must be indicated by "|" (e.g. 0|1).

So, if your vcf is not in this format you can bring it to such format manually as long as the required columns are present. If your file does not have a header is also fine. For example this is also fine:

chr1    10097   .       T       C       .     .       .       .     0/1
chr1    10197   .       T       C       .     .       .       .     0|1
chr1    10291   .       C       T       .     .       .       .     0/1
chr1    10391   .       G       T       .     .       .       .     0|1
chr1    10591   .       A       T       .     .       .       .     1|0

Could you send me some lines from your vcf file?

Thanks, Vahid

skambha6 commented 3 years ago

Hi Vahid,

Here are the first few lines of my VCF file:

chr1    3000444 .       T       A       .       .       .       chr1:3000444
chr1    3000608 .       T       G       .       .       .       chr1:3000608
chr1    3000748 .       T       TT      .       .       INDEL   chr1:3000750
chr1    3000748 .       T       TT      .       .       INDEL   chr1:3000751
chr1    3000748 .       T       TT      .       .       INDEL   chr1:3000752

This VCF file was generated by running MUMMER on two completely homozygous genomes (collaborative cross mice genomes) and converting the resulting .delta file to a VCF file. My sample is a cross between these two homozygous parent mice. Since I know my haplotypes a priori (because the parents are entirely homozygous), would it be sufficient to manually add in a 0|1 for each variant in the sample column?

Thank you for your help!

Best, Sandeep

vahidAK commented 3 years ago

Yes you can manually add 1|0 or 0|1 as 10th column to your file. For example, assign paternal SNVs as 0|1 and Maternal as 1|0. Using this file, NanoMethPhase will give you HP1 (Maternal) and HP2 (Paternal).

skambha6 commented 3 years ago

Ok, great. Thank you!