treangenlab / methphaser

MethPhaser: methylation-based haplotype phasing of human genomes
https://www.nature.com/articles/s41467-024-49588-0
MIT License
42 stars 1 forks source link

Empty output vcf and bam #17

Open andosl opened 1 year ago

andosl commented 1 year ago

Hi,

I am using methphaser as a singularity image on an HPC. I have made the singularity image work with the test data, so shouldn't be any technical issues with the image.

I am doing adaptive sequencing of a 1,2 Mbp region on chromosome 14 (the IGH locus). I am assembling the reads to a personal reference of this region, and then I re-map the ROI-reads to this personal reference. This works very well for my purpose, but when I am adding methphaser to the pipeline it yields only empty vcf and bam files (only headers). Shouldn't it at least contain the information in the input bam and input vcf files? I am not getting any errors when running methphaser.

I would be extremely grateful if you could help me troubleshoot.

Below is truncated input bam file with frist two sequences, truncated input vcf file, gtf file and truncated fast-reference:

Truncated BAM file:

750c6e3f-9e8c-44b9-9e31-fc66c1b1fc2c 16 contig_1 1 60 1189S416M1I1717M1I1167M5D438M1D86M1I193M2D234M2D285M1I59M1D36M2D23M1I365M1D1397M1I793M1D8M1I4M1D198M2D12M1D1249M1D5M1I11M102S 0 0 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC&')+069:78A?DHBEDFJGD>C>G=:8400.---,'')255676;;:<5>?@BFBCIDCBABCDGEBBKCED@=CBECBDCF@@DCCFCA;@@C=?>CAD>>A@CB?BAEA?ECFB?@A>CA?>B@E<?>B?D-5640...0>?@>D:9)'&&%$$$$$$$$((),/8...-.03656=>>=>??ESGJMFMGIBEEGHEHKPIG<=6/.4..-,+))-2DGIFGD:'&)%$$$%&&'+++'%%&%&(+,2223-,,((&$$$%&''+++) NM:i:42 ms:i:17172 AS:i:17172 nn:i:0 tp:A:P cm:i:1027 s1:i:5636 s2:i:324 de:f:0.0039 SA:Z:contig_1,1578,-,2107M7D7888S,1,83; rl:i:4350 CO:Z:MM:Z:C+h?,4,2,4,1,0,0,1,0,1,0,0,0,1,13,13,59,20,42,26,28,24,11,90,4,28,0,38,11,34,1,32,14,23,48,27,17,68,2,5,0,4,126,0,0,3,1,3,1,4,2,9,0,2,2,4,0,5,2,13,0,1,0,0,5,0,8,2,0,1,0,1,0,8,2,0,1,0,1,0,4,2,0,1,0,8,2,0,1,0,1,0,4,2,0,1,0,8,2,0,1,0,1,0,3,2,0,1,0,9,2,0,1,0,1,0,4,2,0,1,0,9,2,0,0,0,9,2,0,0,0,9,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,1,0,3,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,2,0,4,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,0,0,4,2,0,0,0,1,2,15,0,0,0,0,0; ML:B:C,36,8,165,2,1,16,3,3,2,2,3,264,92,253,254,253,235,248,253,254,255,253,254,190,240,248,250,206,239,9,182,209,4,221,183,65,185,15,242,0,0,241,254,255,254,244,245,227,250,250,254,254,196,230,252,65,248,254,223,248,129,252,204,250,254,253,208,123,242,253,247,254,250,13,60,196,217,255,255,255,240,116,250,254,255,68,92,0,86,119,249,255,255,254,234,252,251,248,254,253,33,238,224,254,252,253,20,1,225,171,22,182,252,253,254,247,0,242,245,254,149,52,242,239,243,54,50,62,58,123,107,134,177,248,254,156,215,205,224,202,201,241 HP:i:1 PC:i:210 PS:i:2459 40fce1e4-96a0-4a00-8695-fa9590857b1d 16 contig_1 1 60 1363S1821M1D653M24D119M1I10M1D113M6I391M1D171M1I342M1I149M1D168M9M1D41M3D760M1D45M2D395M1D2713M2D1363M2D207M1D18M2D16M2D15M32S 0 0 AACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA&)135@AABBBF?G=A:EBDGCBDBHCA@D@EDA>DEDED>BCDEDDEBDCB>CBFFB>C@C???A@DB@@@>C>>>DADDA?CAEBA:AAD;=BC@CA?;C?CCA9CADB@BC@FB?;?>?>>=@>BBA?B=?=;;4AA46FIBAGMSLSIJKIIIJJSIISSJHSHKGEGJGSJLQKSHBBJ2?FFIIIGIISLQSIKJOIHSPSNHSFSKSSSEGJSBLEFA??=999;>SLKLSIILSJHSS??@BBBBAEDABHISMSSSLOLMLSGNHJQKJISLJSRNSLJHJIMSSOIOLFISMMMJIKILHEHFSPSKKSSSGGLOJJJLHIKSSJMIKMLMKJSJKHHSJPIJSLJND0<;=AB66666LSSIIIJEC6-,,,2015>DIE###$$$%&')&(.,+'%%%+356/..-.56((((()'))*,,,,+'&%&%$$%')))'&&&%$$ NM:i:320 ms:i:112130 AS:i:112134 nn:i:0 tp:A:P cm:i:9642 s1:i:52735 s2:i:2852 de:f:0.0037 rl:i:5199 CO:Z:MM:Z:C+h?,16,0,1,3,2,1,1,5,1,2,2,7,3,0,4,1,37,0,3,1488,2,0,0,0,9,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,1,0,2,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,1,0,3,2,0,0,0,1,0,4,2,0,0,0,1,0,4,2,0,0,0,1,2,15,0,0,0; ML:B:C,29,104,69,23,0,2,62,2,2,2,1,1,1,1,3,0,0,1,0,11,29,1,9,7,3,6,8,22,2,4,1,2,49,1,2,2,68,1,28,1,0,1,92,1,5,2,224,254,8,8,3,5,1,9,2,9,3,2,1,1,1,3,144,2,18,30,9,5,0,5,1,14,1,4,4,1,3,4,0,5,0,1,4,0,1,42,0,53,2,4,16,24,2,0,1,4,1,2,26,41,115,1,2,0,1,3,2,3,9,1,1,19,1,13,7,7,0,1,3,30,7,10,1,149,29,230,252,251,189,252,255,254,247,245,253,202,0,243,254,197,236,240,253,249,252,251,241,133,120,54,240,249,235,130,196,36,249,253,252,216,228,212,233,1 HP:i:2 PC:i:210 PS:i:2459

Truncated VCF file:

fileformat=VCFv4.2

FILTER=

source=Clair3

clair3_version=1.0.0

FILTER=

FILTER=

INFO=

INFO=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

contig=

FORMAT=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE

contig_1 2459 . C A 10.87 PASS F GT:GQ:DP:AF:PS 0|1:10:24:0.2917:2459 contig_1 2465 . C A 10.99 PASS F GT:GQ:DP:AF:PS 0|1:10:24:0.2917:2459 contig_1 2475 . TACCCCAACCCCAACCCCAACCCCA T 9.78 PASS P GT:GQ:DP:AF 0/1:9:24:0.2083 contig_1 2588 . C T 9.03 PASS P GT:GQ:DP:AF:PS 0|1:9:24:0.375:2459 contig_1 2594 . C T 14.76 PASS F GT:GQ:DP:AF:PS 0|1:14:24:0.375:2459 contig_1 2600 . C T 16.54 PASS F GT:GQ:DP:AF:PS 0|1:16:24:0.4167:2459 contig_1 2618 . T TA 15.35 PASS F GT:GQ:DP:AF 0/1:15:24:0.375 contig_1 2628 . TA T 14.22 PASS F GT:GQ:DP:AF 0/1:14:24:0.3333 contig_1 2804 . G T 19.19 PASS F GT:GQ:DP:AF:PS 0|1:19:24:0.375:2459 contig_1 2894 . G T 23.33 PASS F GT:GQ:DP:AF:PS 0|1:23:24:0.3333:2459 contig_1 3300 . AACCCT A 17.11 PASS F GT:GQ:DP:AF 0/1:17:24:0.5833 contig_1 3305 . T TA 8.56 PASS F GT:GQ:DP:AF 0/1:8:24:0.375 contig_1 66619 . AT A 0 LowQual F GT:GQ:DP:AF 1/1:0:23:0.1739

GTF file:

contig_1 Phasing exon 2459 400591 . + . gene_id "2459"; transcript_id "2459.1"; contig_1 Phasing exon 641089 1001211 . + . gene_id "641089"; transcript_id "641089.1"; contig_1 Phasing exon 1230613 1244186 . + . gene_id "1230613"; transcript_id "1230613.1"; contig_1 Phasing exon 1582658 1670184 . + . gene_id "1582658"; transcript_id "1582658.1"; contig_1 Phasing exon 1858102 1886017 . + . gene_id "1858102"; transcript_id "1858102.1"; contig_1 Phasing exon 2054129 2503491 . + . gene_id "2054129"; transcript_id "2054129.1";

Truncated FASTA reference:

contig_1 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC

Thanks a lot in advance!!

Best, Andreas

Fu-Yilei commented 1 year ago

Hi Andreas,

I would imagine this is an issue related to the read length or incomplete reference genome. I have only run MethPhaser before with the standard human genome reference like GRCh37 or GRCh38, which is not like the FASTA reference that you provided.

andosl commented 1 year ago

Hi Fu-Yilei,

Thanks a lot for speedy answer!

I reformatted all the input files replacing "contig_1" with "chr13", and I reindexed the bam, fasta, and vcf files. This seemingly did the trick (?!). I have to evaluate the output and will let you know.

Regards, Andreas

Fu-Yilei commented 1 year ago

Thanks, if this is the case there might be some chromosome name reading bug in MethPhaser, I will look into it and make it workable when the chromosome name is something other than "chrxx"

andosl commented 1 year ago

Hi, It seems it is the "contig_1" in these files that makes MethPhaser struggle. This is the only change that I can do to the files to make them work. But it is an easy trick, not a big deal.

Would be great to discuss some of the outputs with you. When I open the haplotagged bam in IGV, the methylation pattern seems quite similar on the two haplotypes and it is hard to eyeball why MethPhaser has chosen one haplotype from the other.

Kind regards, Andreas