lh3 / hickit

TAD calling, phase imputation, 3D modeling and more for diploid single-cell Hi-C (Dip-C) and general Hi-C
108 stars 11 forks source link

Can hickit deal with nanopore long reads and generate phasing results properly? #42

Open Wong718 opened 3 weeks ago

Wong718 commented 3 weeks ago

Hello Professor Li. It's a wonderful tool for 3D genome analysis. Recently, I am dealing with the scNanoHiC data with bwa-sw and hickit, and I found that the haplotype phasing results were confusing. And I want to ask if hickit could deal with the nanopore long sequencing data as properly as NGS. The sam file generated by bwa-sw is looked like as follows, where a single read may map to multiple positions and generated several records with the same id.

d3d59d85-f117-406d-93e3-4901250df094    0       chr10   118628182       42      344S305M441S    *       0       0       GTTCAGTTACGTATTGCTAGCTCTTTCCCTACACGACGCTCTTCCGATCTGAGATTAAAAAAAAAAAAAAAAACATTTTAACCTAGGTGGAAGTGGAGGGAGGAGGGGACGAAGGAGAGAATAAGAAATTTCTGGAGCTTTTAACAAGGGGAGTGTGAGGGTAATCCAGCAATTCAGAAGCCGGGCGCGGTGGCTCATGCCTATAATCCCAGCACTTATTGGGAGGCCGAGGCAGGTGGATAGCTTGAGCCCAAGAGTTCGAGACCACCCTGGCCAACATAGTGAGAACCCCCCATCTCTATTTAAACAACAACAAAAAAGAAATTTGAGAACAACTGCCCCCATAGCTGGGCATGGTGGCACACGCCTGTAATCCCAGCTACTCAGAGGGCTGAGGCAGGAGAATCACTTGGACCCAGGAGGCAGAGGTTGTAGCGAGCCAAGATCATACCACTGCATGCCAGCCTGGGAAGGAGAGTGAGATTCCATCTTGGGGAGGGGGGAGGAACCTTGCAAGGTAGATAACAGTAGCCCCTATTTGGAAGGTGGCACAGCTGGGGTCCAGATAGATGAAGTAACTTGCCCAAGGTCACACAGTTAATAAATGGCAAAGCTTGGATTGGAGCCCACATCTTTTGATTATACCACATGAGCATGGCTTTAGACACGCTGGTGCAAGGATCTGTGTGACCTCTAATCTCACAAGAGTCCTTGCTCAGACCCAGAAAGGGCTTCTCTACAGTATAGGAGAGGAATACCTCCAGGTTGCATGTGGGCAGCTGCCAACGTGAATGGCTTGGTCCTCAGCCTATAGAGCTTAAAGGTATTTTGTATAAGCCTAGTTTCCTCCTGTATAAAAAGGGATAAAACATGAACCCTATGTGGTTGTTGCAGGAGGATGTGAAAGTGCTGCCCCAGTACTTGGTATTAAGAATATCAATAAATCATTAGGACTATGATCTATTTTTAAACAATTTTCAAACAAAGTATTACCAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACAAAGACACCAACAACTTTCTTATCTCGTATGCCGTCTTCTGCTTGAGCAATACGTGGG      (+,./.--,-+++,/'(&$%&'&(/215982110+*)())*10,-'&&&1%%%(--126;<<?CBAA>;::;;943348644520-,-00000012255/),/-255552+++-5532.,,)&%%&'')...034543423422100122531100.--.2100/0017754240///.22377787656:<@?>>>6666@6666;>=@@?@9940*((*159;?@64448?==<87556.--.8;<?>;;;544499////79993333>:;;;?>@532144/0..26>=;87789<211//0331201101169:8511442213211.//-///34>>91//268:;A=;;;=?>@::954335A>><<==BBAEA@@?A:84//2/:<>>?@@???@>@=><4210,++,10:<>C?>99::>;;:***'***<@?>=<<<710008<;;;;33348;<===<841////0324449----65656653410148<875502.--.135344100/.,,,22....45873224577645443455655566452244:5210033232100011014575554445445410/0.//043335222255111///012457365556200/20/.---/4542101221225543210221...065433457754546524334354210000220.-.///0-2..433233358764444655566543323/-++,,12122.20//005555510/1100023564444776221/0//00335443322225444320,,++002432232000148620//./11332123498779621.-./543212/-..2/--011233743334411103333323442112221112457666587600//23310/0/4//6322223322222200./0123222111233348975543104598434433430/..//02410/.+++,12123241.-./4534347761//0667655433453((()0.--4343//./1,,,,/00///1000241/--./01000''%%%      AS:i:299        XS:i:58 XF:i:3  XE:i:1  NM:i:1
d3d59d85-f117-406d-93e3-4901250df094    0       chr2    164815549       39      828S167M95S     *       0       0       GTTCAGTTACGTATTGCTAGCTCTTTCCCTACACGACGCTCTTCCGATCTGAGATTAAAAAAAAAAAAAAAAACATTTTAACCTAGGTGGAAGTGGAGGGAGGAGGGGACGAAGGAGAGAATAAGAAATTTCTGGAGCTTTTAACAAGGGGAGTGTGAGGGTAATCCAGCAATTCAGAAGCCGGGCGCGGTGGCTCATGCCTATAATCCCAGCACTTATTGGGAGGCCGAGGCAGGTGGATAGCTTGAGCCCAAGAGTTCGAGACCACCCTGGCCAACATAGTGAGAACCCCCCATCTCTATTTAAACAACAACAAAAAAGAAATTTGAGAACAACTGCCCCCATAGCTGGGCATGGTGGCACACGCCTGTAATCCCAGCTACTCAGAGGGCTGAGGCAGGAGAATCACTTGGACCCAGGAGGCAGAGGTTGTAGCGAGCCAAGATCATACCACTGCATGCCAGCCTGGGAAGGAGAGTGAGATTCCATCTTGGGGAGGGGGGAGGAACCTTGCAAGGTAGATAACAGTAGCCCCTATTTGGAAGGTGGCACAGCTGGGGTCCAGATAGATGAAGTAACTTGCCCAAGGTCACACAGTTAATAAATGGCAAAGCTTGGATTGGAGCCCACATCTTTTGATTATACCACATGAGCATGGCTTTAGACACGCTGGTGCAAGGATCTGTGTGACCTCTAATCTCACAAGAGTCCTTGCTCAGACCCAGAAAGGGCTTCTCTACAGTATAGGAGAGGAATACCTCCAGGTTGCATGTGGGCAGCTGCCAACGTGAATGGCTTGGTCCTCAGCCTATAGAGCTTAAAGGTATTTTGTATAAGCCTAGTTTCCTCCTGTATAAAAAGGGATAAAACATGAACCCTATGTGGTTGTTGCAGGAGGATGTGAAAGTGCTGCCCCAGTACTTGGTATTAAGAATATCAATAAATCATTAGGACTATGATCTATTTTTAAACAATTTTCAAACAAAGTATTACCAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACAAAGACACCAACAACTTTCTTATCTCGTATGCCGTCTTCTGCTTGAGCAATACGTGGG      (+,./.--,-+++,/'(&$%&'&(/215982110+*)())*10,-'&&&1%%%(--126;<<?CBAA>;::;;943348644520-,-00000012255/),/-255552+++-5532.,,)&%%&'')...034543423422100122531100.--.2100/0017754240///.22377787656:<@?>>>6666@6666;>=@@?@9940*((*159;?@64448?==<87556.--.8;<?>;;;544499////79993333>:;;;?>@532144/0..26>=;87789<211//0331201101169:8511442213211.//-///34>>91//268:;A=;;;=?>@::954335A>><<==BBAEA@@?A:84//2/:<>>?@@???@>@=><4210,++,10:<>C?>99::>;;:***'***<@?>=<<<710008<;;;;33348;<===<841////0324449----65656653410148<875502.--.135344100/.,,,22....45873224577645443455655566452244:5210033232100011014575554445445410/0.//043335222255111///012457365556200/20/.---/4542101221225543210221...065433457754546524334354210000220.-.///0-2..433233358764444655566543323/-++,,12122.20//005555510/1100023564444776221/0//00335443322225444320,,++002432232000148620//./11332123498779621.-./543212/-..2/--011233743334411103333323442112221112457666587600//23310/0/4//6322223322222200./0123222111233348975543104598434433430/..//02410/.+++,12123241.-./4534347761//0667655433453((()0.--4343//./1,,,,/00///1000241/--./01000''%%%      AS:i:167        XS:i:0  XF:i:3  XE:i:1  NM:i:0
d3d59d85-f117-406d-93e3-4901250df094    0       chr11   61056073        39      177S167M746S    *       0       0       GTTCAGTTACGTATTGCTAGCTCTTTCCCTACACGACGCTCTTCCGATCTGAGATTAAAAAAAAAAAAAAAAACATTTTAACCTAGGTGGAAGTGGAGGGAGGAGGGGACGAAGGAGAGAATAAGAAATTTCTGGAGCTTTTAACAAGGGGAGTGTGAGGGTAATCCAGCAATTCAGAAGCCGGGCGCGGTGGCTCATGCCTATAATCCCAGCACTTATTGGGAGGCCGAGGCAGGTGGATAGCTTGAGCCCAAGAGTTCGAGACCACCCTGGCCAACATAGTGAGAACCCCCCATCTCTATTTAAACAACAACAAAAAAGAAATTTGAGAACAACTGCCCCCATAGCTGGGCATGGTGGCACACGCCTGTAATCCCAGCTACTCAGAGGGCTGAGGCAGGAGAATCACTTGGACCCAGGAGGCAGAGGTTGTAGCGAGCCAAGATCATACCACTGCATGCCAGCCTGGGAAGGAGAGTGAGATTCCATCTTGGGGAGGGGGGAGGAACCTTGCAAGGTAGATAACAGTAGCCCCTATTTGGAAGGTGGCACAGCTGGGGTCCAGATAGATGAAGTAACTTGCCCAAGGTCACACAGTTAATAAATGGCAAAGCTTGGATTGGAGCCCACATCTTTTGATTATACCACATGAGCATGGCTTTAGACACGCTGGTGCAAGGATCTGTGTGACCTCTAATCTCACAAGAGTCCTTGCTCAGACCCAGAAAGGGCTTCTCTACAGTATAGGAGAGGAATACCTCCAGGTTGCATGTGGGCAGCTGCCAACGTGAATGGCTTGGTCCTCAGCCTATAGAGCTTAAAGGTATTTTGTATAAGCCTAGTTTCCTCCTGTATAAAAAGGGATAAAACATGAACCCTATGTGGTTGTTGCAGGAGGATGTGAAAGTGCTGCCCCAGTACTTGGTATTAAGAATATCAATAAATCATTAGGACTATGATCTATTTTTAAACAATTTTCAAACAAAGTATTACCAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACAAAGACACCAACAACTTTCTTATCTCGTATGCCGTCTTCTGCTTGAGCAATACGTGGG      (+,./.--,-+++,/'(&$%&'&(/215982110+*)())*10,-'&&&1%%%(--126;<<?CBAA>;::;;943348644520-,-00000012255/),/-255552+++-5532.,,)&%%&'')...034543423422100122531100.--.2100/0017754240///.22377787656:<@?>>>6666@6666;>=@@?@9940*((*159;?@64448?==<87556.--.8;<?>;;;544499////79993333>:;;;?>@532144/0..26>=;87789<211//0331201101169:8511442213211.//-///34>>91//268:;A=;;;=?>@::954335A>><<==BBAEA@@?A:84//2/:<>>?@@???@>@=><4210,++,10:<>C?>99::>;;:***'***<@?>=<<<710008<;;;;33348;<===<841////0324449----65656653410148<875502.--.135344100/.,,,22....45873224577645443455655566452244:5210033232100011014575554445445410/0.//043335222255111///012457365556200/20/.---/4542101221225543210221...065433457754546524334354210000220.-.///0-2..433233358764444655566543323/-++,,12122.20//005555510/1100023564444776221/0//00335443322225444320,,++002432232000148620//./11332123498779621.-./543212/-..2/--011233743334411103333323442112221112457666587600//23310/0/4//6322223322222200./0123222111233348975543104598434433430/..//02410/.+++,12123241.-./4534347761//0667655433453((()0.--4343//./1,,,,/00///1000241/--./01000''%%%      AS:i:161        XS:i:0  XF:i:3  XE:i:1  NM:i:1

Then, I applyed the sam2seg function to generated the .seg file as follows, by providing the corresponding .vcf file with -v parameter.

0a0a0f39-8470-464b-aea8-ae41b7967128    chrX!57850012!57850118!+!.!32!1 chrX!118125267!118125631!+!.!47!1       chrX!118200629!118201192!-!.!49!1
0a0a6e33-7628-475a-a306-12ba28ca555d    chr15!94570904!94571063!-!.!36!1        chr15!97971692!97972740!-!1!54!1
0a0a7a06-e53d-4c7a-8675-ff8bf1b74ec3    chr7!7146551!7147017!-!.!48!1   chr10!89043022!89043154!-!1!35!1        chr7!7143778!7144116!+!.!42!1

However, after I generated the .pairs file with modifed seg2pairs function (I have modified this function to generate multi-contact for one read), I observed that the trans-parental contacts in the same chromosome were more than expected, and I want to figure out why. In general, I want to ask if the hickit::sam2seg functon could deal with the .sam files generated by bwa-sw and make correct phasing decision, and what will sam2seg do if a long mapped read has SNPs derived from opposite phases.

Wong718 commented 3 weeks ago

To further illustrate the problem, I attach the haplotype contacts map of scNanoHiC below. We could observe that a clear diagonal between the PAT and MAT of the same chromosome, which is not consisted with our knowledge about contacts between alleles. image