vibansal / HapCUT2

software tools for haplotype assembly from sequence data
BSD 2-Clause "Simplified" License
208 stars 36 forks source link

options when combining Hi-C and Pacbio data (update recipe) #43

Closed rcosentino closed 6 years ago

rcosentino commented 6 years ago

Hi,

First of all, thank you very much for this very flexible tool to phase variants! I am trying to use HapCUT2 to phase variants in a diploid genome combining Hi-C and Pacbio data. I am not sure about the options that I would need to use in extractHAIRS for the individual datasets, and in HAPCUT2 after combining the fragment files.

in extractHAIRS I am using obviously --hic 1 and --pacbio 1 for the Hi-C and PacBio data, respectively. But, what about --new_format ?? should is set --new_format to one? in both? or only in the PacBio data? Is this "--new_format" to accommodate to the change of format made by "--hic 1"? I think I read this in a previous version of the tool but I cannot find the information now.

after combining the fragments, what options should I use in HAPCUT2 ?

--hic 1 --long_reads 1 --new_format 1 ? if I use "--error_analysis_mode 1" as it is suggested for pacbio data I get a warning saying it Is not suitable for Hi-C data

Could you please update the recipe? Many of the new options were not available at the time you generated the recipes.

Cheers,

pjedge commented 6 years ago

I've updated the HiC+longread recipe. The only thing I did was add the new --pacbio option under the long read fragment extraction.

--new_format should only be used for non-hic reads to make the file format compatible with HiC reads. So it should be used for only pacbio in this case.

After combining the fragments you should use the --hic 1 option in hapcut, because this initiates the more complex Hi-C specific algorithm, but it knows which reads are longreads and which reads are Hi-C based on information written in the "new format". So it will model the HiC reads as HiC and the longreads as "normal" reads.

Don't use "error_analysis_mode" if you're using HiC reads at all. The primary thing it does is calculate switch error scores, which are useful for cases where you expect mostly independent point-switch errors (i.e. haplotype blocks assembled from long reads, having "weak points" of low heterozygousity, low read coverage, or high error) but it will be not useful or maybe very misleading in cases where the switch error profile is complex (HiC, or HiC with longreads)

Sorry for the confusion here.

EDIT: also, I'll note that if you used the old pipepline before I made this tweak (not using --pacbio) I think the resulting haplotypes would be highly complete and highly accurate in general, unless coverage of the reads is very low. Adding this option reduces the number of allelotyping errors in the PacBio reads from indels, and will make the overall results even more accurate.

rcosentino commented 6 years ago

Thank you very much!