marbl / Winnowmap

Long read / genome alignment software
Other
235 stars 22 forks source link

No -I option #40

Open soisa001 opened 11 months ago

soisa001 commented 11 months ago

When running winnowmap, the -I option is not recognized. e.g. after generating the repetitive_k15.txt with meryl:

winnowmap -W repetitive_k15.txt -a -x map-pb -Y -L --eqx --cs -I 32G ref.fa.gz reads.fastq.gz | samtools view -hb | samtools sort -@8 > alignment_sorted.bam

Yields the following error:

[ERROR] unknown option in "-I"

The -I option is needed for a multi-part index. Thanks.

cjain7 commented 11 months ago

Sorry, multi-part indexing is not supported yet.

diego-rt commented 8 months ago

Hi @cjain7

Does this mean that it's not possible to map to genomes larger than 4G while getting accurate mapQs? For minimap2 this would be the case in the absence of the -I flag.

Thanks!

skoren commented 2 months ago

I saw this change was added post the last v2.0.3 release version so the condo-installed versions allow using the -I option. I do see slight differences in alignments when increasing -I on genomes w/>4gb genome size. I wanted to confirm if it is safe to use this option assuming no saved index is used or was it removed because it was not working correctly in v2.0.3 as well?

cjain7 commented 1 month ago

Hi Sergey, I looked at this now; sorry for the delay in responding. Your question is best answered at the minimap2 help page https://lh3.github.io/minimap2/minimap2.html

Increasing the -I value will help you get slightly more accurate alignments because having the entire reference is helpful to identify the best alignment for a read, and also for computing the mapping qualities. In my view, -I option should not be given to the user during read-to-genome mapping. If it is provided, it is best to ensure that the value is more than the genome size. Most likely, this was the reason why I omitted -I from the development code.

My guess is that minimap2 has -I parameter because it is also used as a read overlapper, and for mapping reads to very large reference databases. Even then, having -I is sub-optimal but it is necessary to control RAM usage.

skoren commented 1 month ago

The issue is the default -I is only 4gb so even a diploid human genome is too big and we'd want to increase -I (in fact we do when mapping for all our T2T analysis to both haplotypes: https://github.com/arangrhie/T2T-Polish/blob/master/winnowmap/map.sh). There are also much larger genomes (see https://github.com/marbl/verkko/issues/252) which is what made me start looking into this. In these cases of large references it sounds like the -I option would be important to set to be larger than the genome size so it'd be nice to keep it available in future releases since it is being used.

diego-rt commented 1 month ago

Sorry to jump in, but as a heavy user of giant genomes (30 Gbp and more), I think it is absolutely indispensable to have the -I option enabled.

cjain7 commented 1 month ago

Understood, thank you! The -I option is now back :)