lh3 / fermikit

De novo assembly based variant calling pipeline for Illumina short reads
Other
108 stars 23 forks source link

(ha)ploidy #3

Closed Perugolate closed 9 years ago

Perugolate commented 9 years ago

is it possible to change the ploidy for the variant calling steps (can't see a flag)?

lh3 commented 9 years ago

The entire de novo assembly process makes no assumptions on ploidy, I believe. Variant calling does assume a diploid sample, but it should not be too hard to drop this assumption (EDIT: hmm.. still need some code refactoring). I will leave this issue open.

lh3 commented 9 years ago

For haploid genomes, I would recommend to call them as diploid and then filter out (spurious) heterozygotes.

ekg commented 9 years ago

What about pooled samples, perhaps with tens or hundreds of individual genomes represented? On Apr 30, 2015 12:29 AM, "Heng Li" notifications@github.com wrote:

For haploid genomes, I would recommend to call them as diploid and then filter out (spurious) heterozygotes.

— Reply to this email directly or view it on GitHub https://github.com/lh3/fermikit/issues/3#issuecomment-97605921.

lh3 commented 9 years ago

BFC keeps a base if there are 4 or more base of the same type. Unitig construction is not directly affected by ploidy or base frequency (though low-frequency haplotypes are harder to assemble). Graph simplification may pop a bubble when the frequency is below 15% (controlled by option fermi2 simplify -r). Variant calling is unaware of ploidy. Filtering sets a threshold on allele balance, so it will filter low-frequency alleles.

In summary, fermikit is tuned for diploid samples, but with minor modifications, it should work with pooled samples or samples having low-frequency alleles.

ekg commented 9 years ago

That's excellent to hear. I have been wondering if we could build graph reference systems using fermikit on top of pooled samples. It would seem that this should work, barring memory issues caused by increasing the input read set.

Is there any way to merge cleaned graphs from different single-sample assemblies?

Perhaps this is a topic to discuss off-thread.

On Thu, Jul 23, 2015 at 2:08 PM, Heng Li notifications@github.com wrote:

BFC keeps a base if there are 4 or more base of the same type. Unitig construction is not directly affected by ploidy or base frequency (though low-frequency haplotypes are harder to assemble). Graph simplification may pop a bubble when the frequency is below 15% (controlled by option fermi2 simplify -r). Variant calling is unaware of ploidy. Filtering sets a threshold on allele balance, so it will filter low-frequency alleles.

In summary, fermikit is tuned for diploid samples, but with minor modifications, it should work with pooled samples or samples having low-frequency alleles.

— Reply to this email directly or view it on GitHub https://github.com/lh3/fermikit/issues/3#issuecomment-124091024.

lh3 commented 9 years ago

You can assemble unitigs with:

cat *.mag.gz | ropebwt2 -dr > unitigs.fmd
fermi2 assemble -l51 -t16 unitigs.fmd > assembled.mag

simplify would not work well. you will also lose per-base read depth.

ekg commented 9 years ago

Losing the per-base depth is not a problem, as the idea will be to realign the input sequencing reads to the graph. This will provide full annotation of sequencing quality and depth for anything in the graph. Also, pruning bubbles created by small variants would not be too much of a problem as we can recover these via resequencing against the combined graph.

On Thu, Jul 23, 2015 at 4:09 PM, Heng Li notifications@github.com wrote:

You can assemble unitigs with:

cat *.mag.gz | ropebwt2 -dr > unitigs.fmd fermi2 assemble -l51 -t16 unitigs.fmd > assembled.mag

simplify would not work well. you will also lose per-base read depth.

— Reply to this email directly or view it on GitHub https://github.com/lh3/fermikit/issues/3#issuecomment-124135115.