lh3 / yak

Yet another k-mer analyzer
MIT License
117 stars 9 forks source link

Updated output documentation for yak triobin #1

Open williamrowell opened 4 years ago

williamrowell commented 4 years ago

https://github.com/lh3/yak/blob/6de3affe265bb0508ed6d78b36f121bdaf796f71/triobin.c#L176

Do you have any updated documentation for the output of yak triobin? I'm looking at the output of verison r43, which has 13 columns as opposed to the 10 columns documented in the help text. I'm especially trying to understand column 2, which has values m, p, a, and 0.

lh3 commented 4 years ago
williamrowell commented 4 years ago

Thanks for the quick answer! That's what I guessed, but wanted to make sure before proceeding. Thanks for the tool!

lh3 commented 4 years ago

Forgot to say that you can ignore most of other columns. Those are mostly for debugging purpose.

zeeev commented 4 years ago

Dear @lh3,

We are testing out trio binning and it looks like our binned assemblies are more fragmented than the non-binned assemblies. Both haplotypes have good coverage. Is there a way to adjust the triobinning step to be more specific? I.E. require more p/m markers?

What is the meaning of these options?:

  -c INT     min occurrence [2]
  -d INT     mid occurrence [5]

Do you have any suggestions for improving binning at the counting stage?

lh3 commented 4 years ago

By default, if a k-mer occurs 5 times or more in mother but occurs twice or less in father, the k-mer is considered to be a mother-specific k-mer. The label on the 2nd column is determined by the rest of columns under complex rules coded in function tb_classify(). You can't tune these rules on the command line.

It is hard to get perfect trio binning. Hifiasm effectively uses the HiFi assembly graph to fix binning errors. Without doing that, hifiasm would only get ~10Mb N50, comparable to trio HiCanu.

lh3 commented 4 years ago

For a simple way to increase specificity:

awk '$3>=21&&$4<=2&&$2=="p"' triobin.txt > paternal.txt
awk '$4>=21&&$3<=2&&$2=="m"' triobin.txt > maternal.txt
# the rest are ambiguous
zeeev commented 4 years ago

Hi @lh3,

Thank you for sharing these ideas. Just confirming, you think triobinning isn't as effective as just assembling and phasing in a single genome? That has been my experience, at least using yak and HifiASM/IPA.

lh3 commented 4 years ago

Yes, when HiFi phasing and trio phasing are inconsistent, HiFi phasing is often the correct one.

lh3 commented 4 years ago

In early days, we tried hicanu trio binning. I manually inspected many differences between hicanu and yak binning. I think yak is generally more accurate. Nonetheless, the assembly with hicanu binning is similar to the assembly with yak binning.

lh3 commented 4 years ago

Also, hifiasm applies trio binning to error corrected reads. This noticeably improves the binning accuracy: there are much fewer inconsistencies between trio phasing and hifi read phasing.