mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
789 stars 168 forks source link

HiFi and high quality ultra-long ONT #733

Open JohnUrban opened 1 month ago

JohnUrban commented 1 month ago

Hello sir,

Long time fan and user of Flye, Abruijn, etc.

I have three datasets:

I know there are other assemblers that may out-perform Flye with these datasets, but I am having trouble with them:

Thus, I would like to see what Flye can do here. Perhaps we will use the Flye assembly as is, but I am also wondering if it could be used as a data compression step. For example, I could use the Flye assembly in combination with a smaller subset of data with one of the assemblers above. Just riffing here - I know others would poo-poo such an idea.

So, my question is: Do you have a recommended pipeline to make use of all three datasets, or at least the first two?

I know the FAQs answers a related version of this question, but for older data types:

Can I use both PacBio and ONT reads for assembly?
You can do this as follows: first, run the pipeline with all your reads in the --pacbio-raw mode (you can specify multiple files, no need to merge all you reads into one). Also add --iterations 0 to stop the pipeline before polishing.

Once the assembly finishes, run polishing using either PacBio or ONT reads only. Use the same assembly options, but add --resume-from polishing. Here is an example of a script that should do the job (thanks to @jvhaarst):

flye --pacbio-raw $PBREADS $ONTREADS --iterations 0 --out-dir $OUTPUTDIR --genome-size $SIZE --threads $THREADS
flye --pacbio-raw $PBREADS --resume-from polishing --out-dir $OUTPUTDIR  --genome-size $SIZE --threads $THREADS

Would it be recommended to do swap out the --pacbio-raw flag for --pacbio-hifi ? ::

flye --pacbio-hifi $PBREADS $ONTREADS --iterations 0 --out-dir $OUTPUTDIR --genome-size $SIZE --threads $THREADS
flye --pacbio-hifi $PBREADS --resume-from polishing --out-dir $OUTPUTDIR  --genome-size $SIZE --threads $THREADS

Or maybe treat both as --nanopore-hq ? followed by --pacbio-hifi polishing::

flye --nano-hq $PBREADS $ONTREADS --iterations 0 --out-dir $OUTPUTDIR --genome-size $SIZE --threads $THREADS
flye --pacbio-hifi $PBREADS --resume-from polishing --out-dir $OUTPUTDIR  --genome-size $SIZE --threads $THREADS

Or perhaps even some type of 3 or 4 step procedure, using intermediate assemblies as part of the input for the final assembly:

# hifi asm
flye --pacbio-hifi $PBREADS  --iterations 0 --out-dir $OUTPUTDIR --genome-size $SIZE --threads $THREADS

# nano hq asm
flye --nano-hq $PBREADS $ONTREADS --iterations 0 --out-dir $OUTPUTDIR --genome-size $SIZE --threads $THREADS

# combined asm (either -pacbio-hifi or --nano-hq flag)
flye --pacbio-hifi $PBREADS $ONTREADS $HIFIASM $NANOASM --resume-from polishing --out-dir $OUTPUTDIR  --genome-size $SIZE --threads $THREADS

#polishing
flye --pacbio-hifi $PBREADS --resume-from polishing --out-dir $OUTPUTDIR  --genome-size $SIZE --threads $THREADS

Any thoughts would be appreciated.

Best,

John

p.s. I suppose the nanopore reads could be corrected with Herro as a possibility too.

p.p.s. As for the Hi-C data, I know Flye doesn't take it directly. Do you recommend a particular Hi-C scaffolder for Flye assemblies, and are there clean-up steps/etc recommended prior to using it?

JohnUrban commented 1 month ago

I guess a follow-up would be:

JohnUrban commented 1 month ago

Well I will try each of these strategies and report back to you, but am happy to hear you weigh in anyway if you get a chance.

aabaricalla commented 1 month ago

Hi @JohnUrban .

A few suggestions would be: 1- usegalaxy.org or usegalaxy.eu = a public platform with Verkko, Hifiasm, and Flye. There are no RAM problems, as you are having right now. Europe server is usually a better option. 2- "Would it be recommended to do swap out the --pacbio-raw flag for --pacbio-hifi ?", I suggest you use the "flye --pacbio-raw" option because you are merging ONT+CCS, there is not uniform with less than 1% error rate. 3- use "nano-hq" if you have the latest Q20 ONT reads. 4- To combine multiple assemblies, use "flye --subassemblies" instead of using the assemblies as reads. 5- "p.s. I suppose the nanopore reads could be corrected with Herro as a possibility too.", if you have most of the read larger than 10000 bases Herro is recommended. Again you can use them in flye or with any other assemblers. 6 - you can do a flye assembly and a Hi-C scaffolding with Yahs + Arima pipeline. 7- Don't forget to check BUSCO alongside the N50 or AuN metrics.

Flye it's awesome, one of my personal best, but I think using multiple strategies and keep the best result.

Isoris commented 1 month ago

Hello, I would like to ask if Flye could improve a Chromosome level assembly by solving repetitive regions ?

For instance I have made a QV50 - 99.4% K mer complete genome assembly of my species but there are still 600 gaps left after using all of the HiC data + all related reference genome species and doing TGS and ragtag. I wonder wether or not Flye would give a better results in these very very small gaps that are probably made of tandem repeats ?? Or how to reduce the gaps from 600 to 1 because ive seen in another species of catfish the authors got 1 gap only in their assembly but when I align onto them it still gives 500 gaps ? I have tried wtdbg2, hifiasm hic ul mode with all red types combinations with and without purging, greenhill a scaffolder and phasing tool. and when combining all of the solutions i could only reduce to 400 gaps !?

I have 13 X HIC reads, 30 X old nanopore reads 20% err., 40X hic and 40x PE150. What would you suggest me to do in this position?

Thank you for your answer, Quentin.

aabaricalla commented 1 month ago

Hi @Isoris

Certainly you can use flye as a polisher but usually this is reserved for HiFi reads. It would be best if you tried with the ONT reads or preferably error-corrected ONT reads (With Herro or CANU self correction, or Ratastok, Masurca to Illumina corrected ONT reads).

Having this:

I have 13 X HIC reads, 30 X old nanopore reads 20% err., 40X hic, and 40x PE150. What would you suggest me to do in this position?

You can also try Redundans to improve your gaps.

Last but not least, it's truly necessary to close 400 gaps? your N50/N90/AuN values (>5Mb) or BUSCO scores are acceptable (>80-90%)? if you have a chromosome-scale assembly, do you want to invest so much time to close this low number of gaps?

Isoris commented 1 month ago

Ok so I'll proceed to targetted polishing, run redundans and thats it. thanks.

JohnUrban commented 3 weeks ago

Based on generating a bunch of assemblies, it seems like the best way to go would be to: (1) Generate the HiFi assembly alone. (2) Re-run fly with the UL-ONT data something similar to the now-deprecated --subassemblies HiFi-asm.fa option to get Flye to treat it more like a scaffolding problem (ONT scaffolds HiFi assembly).

The HiFi data alone produces better assemblies than trying to combine the HiFi and ONT various ways I've tested, but it is clear from Verkko/HiFiasm assemblies that the ONT data does add scaffolding information.

@mikolmogorov does the --subassemblies option still sort of work? do you know of a way to use the ONT data to map to the assembly graph to help with long-range path info and scaffolding? Maybe there is already a set of tools out there that one could tack on to the end of a HiFi Flye run to make better use of the ONT data. I guess I would be hesitant though to use older tools like PBJelly...

mikolmogorov commented 1 week ago

@JohnUrban subassemblies option is deprecated, but you can manually adjust alignment parameters in the config file using --extra-params option to emulate it.