lbcb-sci / raven

De novo genome assembler for long uncorrected reads
MIT License
202 stars 21 forks source link

QUESTION: What's the largest genome that end-users have assembled with RAVEN? #36

Open cement-head opened 3 years ago

cement-head commented 3 years ago
  1. What's the largest genome that end-users have assembled with RAVEN?
  2. Did you use the GPU version (built for CUDA/GPU)
  3. What were your options, if any?
  4. How long did it take?
  5. Approximately how big was your computer?
rvaser commented 3 years ago

Here is the preprint: https://www.biorxiv.org/content/10.1101/2020.08.07.242461v1. Although, the version in the benchmark is 1.1.10, and versions 1.3.0 and upwards use far less memory. We should update the preprint soon. Answers:

  1. I think 3Gbp (haploid) size, not sure tho.
  2. We did not benchmark with CUDA enabled.
  3. No additional options, only number of threads.
  4. Depends on coverage, see preprint.
  5. 1TB RAM/128 cores (run on 64 threads).
cement-head commented 3 years ago

Okay, we just did a 6.0 Gbp beastie; but RAVEN gave us just over 7.0 Gbp.

Took five days, 2 TB ECC RAM; 124 threads; two CUDAS (RTX TITANS used for polishing; -c=100)


Given that the assembly is a little large, I'm wondering if I should change any of these three parameters, and whether or not you'd have some recommendations?

-m, --match <int>
      default: 3
      score for matching bases
    -n, --mismatch <int>
      default: -5
      score for mismatching bases
    -g, --gap <int>
      default: -4
      gap penalty (must be negative)
cement-head commented 3 years ago

Also, would increasing the rounds of polishing (RACON) drastically improve the assembly?

cement-head commented 3 years ago

Okay - got 0.1% Complete with a BUSCO analysis. Something is wrong, would you suggest increasing the penalty for the mismatch score?

rvaser commented 3 years ago

Can you print the assembly statistics (length/#contigs/NX/NGX)? Which sequencing technology are you using? What is the sequencing depth? The BUSCO score is abysmal, not sure if changing alignment parameters will help. Running more than 2 iterations of Racon will not increase the accuracy by much either.

Sorry for my late reply! Best regards, Robert

P.S. You can also paste here the log Raven created.

cement-head commented 3 years ago

Technology is PacBioSII CLR with the N50 of the raw reads >36Kbp.

The coverage is about 70x.

Q: Would adjusting the -m, -n, -g parameters improve assembly?

What file is the RAVEN logfile?

Here's the QUAST analysis; the # of contigs is good-ish, but the N50 isn't the greatest:

Assembly                    raven_asm 
# contigs (>= 0 bp)         25505     
# contigs (>= 1000 bp)      25505     
# contigs (>= 5000 bp)      25505     
# contigs (>= 10000 bp)     25504     
# contigs (>= 25000 bp)     25504     
# contigs (>= 50000 bp)     25473     
Total length (>= 0 bp)      7048262437
Total length (>= 1000 bp)   7048262437
Total length (>= 5000 bp)   7048262437
Total length (>= 10000 bp)  7048257309
Total length (>= 25000 bp)  7048257309
Total length (>= 50000 bp)  7046876721
# contigs                   25505     
Largest contig              3296975   
Total length                7048262437
GC (%)                      43.05     
N50                         337254    
N75                         208232    
L50                         6350      
L75                         13031     
# N's per 100 kbp           0.00
rvaser commented 3 years ago

The log is outputed to stderr. I am not sure if changing alignment parameters will help at all. The assembly is quite fragmented which might be the reason for bad BUSCO performance.