malonge / RagTag

Tools for fast and flexible genome assembly scaffolding and improvement
MIT License
470 stars 47 forks source link

Scaffold inserting strings of 100 N's? #157

Closed cizydorczyk closed 1 year ago

cizydorczyk commented 1 year ago

I understand scaffold has the -r, -g, and -m paramters to infer gap sizes, set the min and max inferred gap sizes.

When I run ragtag scaffold with these set to something like -r -g 2 -m 1000000 (to essentially not limit these in my current assembly), I end up with strings of 100 N's inserted in various places, in addition to appropriately sized gaps.

Several instances of 100 N's inserted is not coincidence; where are these coming from if not from these parameters? Does ragtag scaffold default to 100 N's in some other context?

Thanks

malonge commented 1 year ago

Hi there,

Thanks for the question. You can find details in the paper. Indeed, there are certain conditions where RagTag defaults to 100 bp.

cizydorczyk commented 1 year ago

Right -- but which condition might be triggering in this case? I am explicitly trying to avoid any defaulting and am setting options to avoid it, but it still happens.

Am I missing something?

malonge commented 1 year ago

Hi there,

Thanks for your question. At least one condition is when the evidence suggests that sequences overlap, thus suggesting a negative gap length. That is what is meant by "All inferred gap sizes must be at least 1 bp".

There is no way to guarantee that no 100bp gaps will be used. Sometimes, the evidence necessitates 100bp gaps even if setting -r, -g, and -m.