lh3 / fermi-lite

Standalone C library for assembling Illumina short reads in small regions
MIT License
72 stars 23 forks source link

No assembly reported for 100 reads with the same sequence #7

Open nh13 opened 6 years ago

nh13 commented 6 years ago

@lh3 I was playing around with this tool but I couldn't get it to work on a "simple" case. I duplicated a read 100 times and would expect it to output the duplicated read. Any thoughts?

``` @M50205:20:000000000-B82KM:1:1108:8421:4217/2 CTAAGGTGGACATGTTGGCTTCTCTCTGTTCTTAACATGTTAAAATTAAAATTAACTTCTCTGGTGTGTGGAGATGTCTTACAATAACAGTTGCTACTATTTCTTTTCTTTTTCTCTTTCTTTCCTCTCTCTTTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTAGACAAGGTCTCAATTTGTCACTCAGAGTGAAGTGCATTGGCATGAACATTGCTCACTTCATCCTTAACCTTCTTGGCCAAAGAACTCCTCCTGCCTCACCCCC + 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 ```
nh13 commented 6 years ago

I forgot to mention the context. I want to re-assembly a set of reads I know originate from the same haploid copy of the genome, and it's in a tandem repeat. All the reads should start/end around the same place, so it's a bit easier than assembly.

lh3 commented 6 years ago

These 100 reads will be collapsed to one read. You will get a singleton contig, which will be ignored unless you tune parameters.

For cfDNA-like data, assembly may not work well.

nh13 commented 6 years ago

That is actually what I want, a single contig at then end of the day. Think haploid variant calling across repeat regions with indel and mismatch errors. All reads would come from the same DNA molecule.

nh13 commented 6 years ago

I am considering using this instead of consensus calling for duplex sequencing. In this case we have stutter due to PCR slippage across STRs.

nh13 commented 6 years ago

Also, the introduction in the readme implied it would be suitable for re-assembly if short reads, even in runs of LOH. Would you mind sharing the tuning parameters you tube the parameters to output the single contig?

lh3 commented 6 years ago

Your example is violating the basic assumption of assembly and won't happen in practice. You need to test on real data.

nh13 commented 6 years ago

@lh3 challenge accepted, I'll send you a real world dataset where this can happen!

nh13 commented 6 years ago

@lh3 I was wondering if you received the dataset of which I am speaking. I believe it would be a novel application of fermi-lite, where we aren't assembling a genome, but rather reconstructing a source molecule. You could see such applications as re-assembling reads from the same long-molecule (ex. 10x) or with novel sequencing preparations (ex. Duplex Sequencing) benefiting from proper assembly of reads from a single molecule.