lh3 / fermikit

De novo assembly based variant calling pipeline for Illumina short reads
Other
107 stars 22 forks source link

using fermikit to generate a reference graph #13

Open ekg opened 8 years ago

ekg commented 8 years ago

We are working on generating sequence graphs for use as the reference system in vg. I'm completing a first pass at bluntifying the overlap graph that can be pulled out of fermikit's intermediate mag files, which is a necessary step for our alignment and indexing algorithms. We have done a lot of testing on bidirectional string graphs, and so the use of these should be straightforward. It is just an issue of cleaning up the bluntification algorithm and understanding the best way to make the assembly graphs.

The ideal pattern would be to generate a pan genomic reference by assembling all the sequence data from a set of samples into a graph. Reading the Fermi paper it seems that some of the steps in that assembly process might not perform well in a multi sample assembly. Specifically the bubble popping algorithm in fermi has a diploid assumption. Has this been resolved in fermi2 and fermikit? Is it configurable, and if not would of be an easy change for me to make?

I am curious if there might be other contradictory aspects to this approach. Any advice would be welcome.

lh3 commented 8 years ago

The only place where fermikit makes a diploid assumption is at fermi2 simplify -r.15. By default, if the read depth on a path in the bubble is below 15% of the total depth in the bubble, the path will be trimmed off. Note that not applying -S also assumes diploid, but fermikit always calls fermi2 simplify -S.

However, to build a population graph, it may be easier to assemble individual samples separately and then assemble individual unitigs again. In addition, I believe discovar is better at assembly-based variant calling. I have not done a direct comparison, though, as older discovar worked best with 250bp paired-end reads which are not often seen nowadays. It would be good to contact David Jaffe about his latest progress.

lh3 commented 8 years ago

To assemble unitigs:

ropebwt2 -bCr sample1.mag.gz > sample1.fmr   # FM-index for sample1
ropebwt2 -bCri sample1.fmr sample2.mag.gz > sample1-2.fmr  # incrementally insert sample2 to the FM-index
ropebwt2 -bCri sample1-2.fmr sample3.mag.gz > sample1-3.fmr
ropebwt2 -di sample1-3.fmr > sample1-3.fmd   # convert incremental FM-index to the fermi2 read-only format
fermi2 assemble -l $MIN_OVLP_LEN sample1-3.fmd > sample1-3.mag   # assemble

I have not tried this approach. Don't really know if it works.

ekg commented 8 years ago

@willgdjones has been trying this, but ran into a bug immediately at one of the steps (in <1s). Maybe he could document that here?

willgdjones commented 8 years ago

Hi Heng - that's right I'm hitting: when executing the final fermi2 assemble command. Any ideas what this could be? I can post up the MAG files I'm using.

fm6_unitig: Assertion 'e->mcnt[1] >= n_threads * 2' failed

The MAG files I used are in this repo: https://github.com/willgdjones/test