mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
790 stars 168 forks source link

Feature Request: Allow longer minimum overlaps than 10kb. #736

Open JohnUrban opened 3 weeks ago

JohnUrban commented 3 weeks ago

The title says it all.

I have high accuracy nanopore data with read N50 of >100 kb (representing >30-50X coverage) and I would like to try minimum overlaps of 25kb and 50 kb, but get this error:

flye: error: argument -m/--min-overlap: value should be in the range [1000, 10000]

It looks like this could be as simple as changing the min and max values in the argument parser around line 624 in flye/main.py:

    623     parser.add_argument("-m", "--min-overlap", dest="min_overlap", metavar="int",
    624                         type=lambda v: check_int_range(v, 1000, 10000),
    625                         default=None, help="minimum overlap between reads [auto]")

...but I don't know how that will affect anything downstream that may assume a max of 10 kb.....

If there is a reason longer overlaps are not allowed, please let me know.

Many thanks.

(p.s. I will try messing around with the arg parser in the mean time)

JohnUrban commented 3 weeks ago

I can say that, since this feature request, I made the adjustment that I suggested, and in some cases, allowing 25kb, 50kb, and/or 75kb overlaps lead to higher contiguity for ONT-UL asseblies -- and 15kb-20kb for HiFi. (I cannot tell you if the extra contiguity was accurate or not though.)

mikolmogorov commented 2 weeks ago

In principle, it should be possible to increase, but this will require extensive testing. Is there evidence that you are getting better assemblies with increased minimum overlap?

JohnUrban commented 2 weeks ago

When the coverage is high enough and the reads are long enough, I did see contiguity increase.

As for other metrics, if you don't mind waiting, I will report back anything I learn about them in the coming month or two.

As you know better than anyone, Flye sets an overlap length (limited to 10 kb the longest) based on read N50 seemingly w/o considering the amount of coverage. So it sets the same overlap for 30X coverage as for 300X coverage.

I have 120X ultra-long nanopore and 600X HiFi, so I wanted to test the longer overlap cutoffs since I technically have far more coverage than needed for a great Flye assembly.