twolinin / longphase

GNU General Public License v3.0
99 stars 9 forks source link

Illumina-corrected ONT #10

Closed GuillaumeHolley closed 2 years ago

GuillaumeHolley commented 2 years ago

Hi,

First of all, thank you for this tool and the great work that has been done. I have started using LongPhase recently in a project of mine and I have been happy with the results and the performance. Hopefully, I can provide more accurate results soon.

This project however involves ONT reads (N50 is about 20-25kb) that I have corrected with Illumina using Ratatosk. The measured error rate is about 1.4-1.8% although I expect some phase switches in the reads because of the correction (compared to PacBio HiFi). Right now, I used the -pb setting when running LongPhase, do you think the -ont setting would be more adapted? One of these two with some non-default parameter?

My other question is about the haplotag command, I was wondering why the phased SV file is not used in input?

Thank you, Guillaume

ythuang0522 commented 2 years ago

Hi @GuillaumeHolley, the -ont applied heuristics for reducing ONT-specific errors (i.e., homopolymer and modification). There are a few parameters balancing phasing accuracy and contiguity but I dunno what would be the best for Illumina-corrected long reads. If I were you I will also go with -pb as your read quality is close to Q20 and less ONT-specific errors after Rataosk polishing.

The haplotag command was implemented due to a request for embedding LongPhase into the SNP-calling pipeline (Clair3 in Epi2me) and therefore didn't take SVs into account. It simply assinges each read to one of the two haplotypes via majority vote from all SNPs on that read. As the number of SVs is much less than that of SNPs, I expect the tagging results won't differ too much. Having said that, we indeed didn't consider the tagging scenario when co-phasing SNPs and SVs. Will add it in the next release.

GuillaumeHolley commented 2 years ago

Hi @ythuang0522,

Thank you for the feedback. I think the haplotag command makes a lot of sense because many implementation works with haplotagged BAM files (as they must separate reads into haplotypes). I myself need the haplotagged BAM file so I appreciate that feature very much. You are right that for most regions, haplotagging with phased SVs in addition to phased SNPs is not going to make a significant difference. That being said, I can think of a few difficult regions where many SNPs are homozygous and SVs might actually make a difference. Especially for ONT reads which have a higher error rate and there is more chance to phase some SNP incorrectly.

ythuang0522 commented 2 years ago

Hi @GuillaumeHolley

We released a new version (v1.1) which can tag each read according to co-phased SNPs (-s) and SVs (--sv-file). longphase haplotag -s phased_snp.vcf -b alignment.bam -t 32 --sv-file phased_sv.vcf We spotted a few regions lacking of SNPs but with SVs as you expected, which will be tagged solely based on SVs. Thanks for the suggestion.

Yao-Ting

GuillaumeHolley commented 2 years ago

Hi @ythuang0522,

Thanks for the update, will try it asap! Just to make sure I understand everything correctly: with this new update, do you use the SVs for haplotagging only when the region lacks SNPs (otherwise SVs are not considered/used)?

ythuang0522 commented 2 years ago

No. They are both considered during the haplotype assignment of each read (i.e., the majority vote) regardless of the regions. We just spotted a few reads, which were untagged previously due to lacking SNPs, were tagged now thanks to the SVs.

GuillaumeHolley commented 2 years ago

Thanks. One last question out of curiosity: since it is majority voting, an SV of length say 10kb has the same vote as a SNP (so no weighting)?

ythuang0522 commented 2 years ago

Yes. SNP and SV are equally weighted. We don't know how to weigh SVs (with different types and sizes) properly.

GuillaumeHolley commented 2 years ago

Hi @ythuang0522,

Sorry for the delay. Equal weight makes sense at first. Maybe in the future, a naive evolution would be to consider an SV length weight which I think would be fine in most cases I think. Anyway, I'll close the issue now, thank you for your work!

Guillaume