Increase assembly size - Githubissues

CIWa commented 3 years ago

Hi,

I have run flye 2.8-b1674 on a 6 Gb, diploid genome with presumably high amounts of repeats. The coverage of my PacBio data is approximately 50 x. As a result the final assembly is 3.9 Gb in size. I would be interested if anyone has experience with improving the assembly size and which parameters to tweak. My call was: python3 flye --pacbio-raw subreads.fa -o . -t 120

Best wishes, Isabel

mikolmogorov commented 3 years ago

Hi Isabel,

This is not uncommon to see in genomes with high repeat ratio and/or high heterozygousity. You can run BUSCO analysis to check if the coding regions were properly reconstructed. If can submit the flye.log file, I can look through and check if there are any possible optimizations.

Mikhail

CIWa commented 3 years ago

Hi Mikhail,

Thank you for your help. Yes, I was assuming that the coding regions most likely will be relatively well assembled, however, I was hoping to recover as much of the repetitive regions as possible. Therefore thanks a lot for your offer, I have just send you the log-file via mail.

Best wishes, Isabel

mikolmogorov commented 3 years ago

@CIWa thanks for sending the log!

It looks like your genome is diploid and has high heterozygosity rate. Is 6Gb a haploid or diploid genome size estimate? Currently, Flye tends to collapse haplotypes within the regions that only differ by SNPs, which may explain the overall smaller assembly. This also explains relatively low contigs N50 (e.g. for relatively haploid genomes with that coverage you should get N50 well over >10Mb).

I have two suggestions which may or may not help.. First, try run with -m 10000 parameter to increase minimum overlap. Also, try adding -m 10000 --meta, since metagenome mode sometimes helps in diploid assembly.

CIWa commented 3 years ago

Hi Mikhail,

Thanks a lot for your suggestions, much appreciated! I will try both of them in my next run. The genome size estimate is for a haploid genome. But it might contain well over 60 % repeats, which is why I do not expect to recover the whole genome - but the more, the better it of course would be :)

Thanks again, Isabel

jdmontenegro commented 3 years ago

Hi @CIWa I have seen this kind of behavior before in a much smaller (2.1 Gb), but equally repetitive genome (~60%). Interestingly, the repeats were mostly complex, multi-level nested SSR and we were only able to solve these using CCS reads (currently HiFi) with either Hi-Canu or HiFiasm which are able to separate nearly identical mosaik and tandem repeats in the data. Using CLR reads it was not possible to differentiate between true inter-repeat variants and sequencing errors and thus, all assemblers would collapse the repeats into single contigs.

I was actually looking forward to using Mosaik-flye which was announced last year, to tackle at least part of this issue, but I think they are still working on the code, because I have not seen any update in the last couple of months. So, HiFi, might be the way to go here.

Cheers,

Juan D.

mikolmogorov commented 3 years ago

@jdmontenegro - thanks for your input!

@CIWa - I am assuming that your questions are now answered, so I'm closing the issue. Feel free to reopen / post if you have any follow-ups!

CIWa commented 3 years ago

Hi Both,

Thanks for the suggestions. I'm still running the assembly with the suggested added options -m 10000 --meta, however, I am running into an error that does not seem to have been reported yet:

My call: python3 ~/programs/Flye/bin/flye --pacbio-raw ~/pacbio_cellscombined.subreads.fa.gz -o . -t 40 -m 10000 --meta --resume-from consensus

The error:

[2020-12-10 12:22:42] INFO: Running Minimap2
[2020-12-12 13:44:09] ERROR: Error running minimap2, terminating. See the alignment error log  for details: flye10000/10-consensus/minimap.stderr
[2020-12-12 13:44:09] ERROR: Command '['/bin/bash', '-c', 'set -o pipefail; flye-minimap2 flye10000/10-consensus/chunks.fasta pacbio_cellscombined.subreads.fa.gz -x map-pb -t 40 -a -p 0.5 -N 10 --sam-hit-only -L -Q --secondary-seq | flye-samtools view -T flye10000/10-consensus/chunks.fasta -u - | flye-samtools sort -T flye10000/10-consensus/sort_201210_122242 -O bam -@ 4 -l 1 -m 4G']' returned non-zero exit status 139.
[2020-12-12 13:44:09] ERROR: Pipeline aborted

minimap.stderr ended with

[bam_sort_core] merging from 588 files and 4 in-memory blocks...
/bin/bash: line 1: 23833 Segmentation fault      (core dumped) flye-minimap2 flye10000/10-consensus/chunks.fasta pacbio_cellscombined.subreads.fa.gz -x map-pb -t 40 -a -p 0.5 -N 10 --sam-hit-only -L -Q --secondary-seq
     23834 Done                    | flye-samtools view -T flye10000/10-consensus/chunks.fasta -u -
     23835 Done                    | flye-samtools sort -T flye10000/10-consensus/sort_201210_122242 -O bam -@ 4 -l 1 -m 4G

My output at that stage is a 569 G minimap.bam-file in the 10-consensus folder, therefore it seems like minimap2 did indeed finish the job in some way, whether completely successful is of course doubtful though.

Best wishes, Isabel

mikolmogorov commented 3 years ago

@CIWa

Looks like a know problem with minimap2 crash. See #291 and #152 for some info. The best advice I currently have is to make sure you have sufficient disk space, remove any temporary bam files from 10-consensus and try again with smaller number of threads. Some people also reported that the issue might be linked to some HPC cluster environments, e.g. Slurm.

CIWa commented 3 years ago

Hi Mikhail,

many thanks for the help. Unfortunately, thought, the error persisted. I'll continue now with the assembly I already have :)

Best wishes, Isabel

mikolmogorov / Flye

Increase assembly size #308