QUESTION: What's the largest genome that end-users have assembled with RAVEN?

lbcb-sci / raven

De novo genome assembler for long uncorrected reads

MIT License

202 stars 21 forks source link

QUESTION: What's the largest genome that end-users have assembled with RAVEN? #36

Open cement-head opened 3 years ago

cement-head commented 3 years ago

What's the largest genome that end-users have assembled with RAVEN?
Did you use the GPU version (built for CUDA/GPU)
What were your options, if any?
How long did it take?
Approximately how big was your computer?

rvaser commented 3 years ago

Here is the preprint: https://www.biorxiv.org/content/10.1101/2020.08.07.242461v1. Although, the version in the benchmark is 1.1.10, and versions 1.3.0 and upwards use far less memory. We should update the preprint soon. Answers:

I think 3Gbp (haploid) size, not sure tho.
We did not benchmark with CUDA enabled.
No additional options, only number of threads.
Depends on coverage, see preprint.
1TB RAM/128 cores (run on 64 threads).

cement-head commented 3 years ago

Okay, we just did a 6.0 Gbp beastie; but RAVEN gave us just over 7.0 Gbp.

Took five days, 2 TB ECC RAM; 124 threads; two CUDAS (RTX TITANS used for polishing; -c=100)

Given that the assembly is a little large, I'm wondering if I should change any of these three parameters, and whether or not you'd have some recommendations?

-m, --match <int>
      default: 3
      score for matching bases
    -n, --mismatch <int>
      default: -5
      score for mismatching bases
    -g, --gap <int>
      default: -4
      gap penalty (must be negative)

cement-head commented 3 years ago

Also, would increasing the rounds of polishing (RACON) drastically improve the assembly?

cement-head commented 3 years ago

Okay - got 0.1% Complete with a BUSCO analysis. Something is wrong, would you suggest increasing the penalty for the mismatch score?

rvaser commented 3 years ago

Can you print the assembly statistics (length/#contigs/NX/NGX)? Which sequencing technology are you using? What is the sequencing depth? The BUSCO score is abysmal, not sure if changing alignment parameters will help. Running more than 2 iterations of Racon will not increase the accuracy by much either.

Sorry for my late reply! Best regards, Robert

P.S. You can also paste here the log Raven created.

cement-head commented 3 years ago

Technology is PacBioSII CLR with the N50 of the raw reads >36Kbp.

The coverage is about 70x.

Q: Would adjusting the -m, -n, -g parameters improve assembly?

What file is the RAVEN logfile?

Here's the QUAST analysis; the # of contigs is good-ish, but the N50 isn't the greatest:

Assembly                    raven_asm 
# contigs (>= 0 bp)         25505     
# contigs (>= 1000 bp)      25505     
# contigs (>= 5000 bp)      25505     
# contigs (>= 10000 bp)     25504     
# contigs (>= 25000 bp)     25504     
# contigs (>= 50000 bp)     25473     
Total length (>= 0 bp)      7048262437
Total length (>= 1000 bp)   7048262437
Total length (>= 5000 bp)   7048262437
Total length (>= 10000 bp)  7048257309
Total length (>= 25000 bp)  7048257309
Total length (>= 50000 bp)  7046876721
# contigs                   25505     
Largest contig              3296975   
Total length                7048262437
GC (%)                      43.05     
N50                         337254    
N75                         208232    
L50                         6350      
L75                         13031     
# N's per 100 kbp           0.00

rvaser commented 3 years ago

The log is outputed to stderr. I am not sure if changing alignment parameters will help at all. The assembly is quite fragmented which might be the reason for bad BUSCO performance.