morispi / CONSENT

Scalable long read self-correction and assembly polishing with multiple sequence alignment
https://doi.org/10.1038/s41598-020-80757-5
GNU Affero General Public License v3.0
55 stars 5 forks source link

Segmentation fault on long-read correction #4

Closed novikk closed 5 years ago

novikk commented 5 years ago

I'm trying to correct a dataset of real ONT reads and I'm getting a segmentation fault error after the mapping with minimap2:

[irubia@kepler consent]$ /genomics/users/irubia/tools/CONSENT/CONSENT-correct --in /genomics/users/irubia/datasets/ERCC_Mix1_SRR6058582.filtered.fa --out corrected.fa --type ONT
[Wed Feb 13 12:24:42 CET 2019] Self-aligning the long reads (minimap2)
[M::mm_idx_gen::5.006*1.15] collected minimizers
[M::mm_idx_gen::5.682*1.64] sorted minimizers
[M::main::5.682*1.64] loaded/built the index for 116426 target sequence(s)
[M::mm_mapopt_update::6.006*1.60] mid_occ = 473
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 116426
[M::mm_idx_stat::6.203*1.58] distinct minimizers: 8869169 (59.58% are singletons); average occurrences: 2.436; average spacing: 3.014
[M::worker_pipeline::30.410*6.06] mapped 116426 sequences
[M::main] Version: 2.14-r894-dirty
[M::main] CMD: /genomics/users/irubia/tools/CONSENT/minimap2/minimap2 -k15 -w5 -m100 -g10000 -r2000 --max-chain-skip 25 --dual=yes -PD --no-long-join -I100G -t8 /genomics/users/irubia/datasets/ERCC_Mix1_SRR6058582.filtered.fa /genomics/users/irubia/datasets/ERCC_Mix1_SRR6058582.filtered.fa
[M::main] Real time: 30.458 sec; CPU: 184.468 sec; Peak RSS: 1.088 GB
[Wed Feb 13 12:25:14 CET 2019] Correcting the long reads
/genomics/users/irubia/tools/CONSENT/CONSENT-correct: line 173:  9156 Segmentation fault      (core dumped) $LRSCf/bin/CONSENT -a $tmpdir/"$alignments" -s "$minSupport" -S "$maxSupport" -l "$windowSize" -k "$merSize" -c "$commonKMers" -A "$minAnchors" -f "$solid" -m "$windowOverlap" -j "$nproc" -r "$reads" -M "$maxMSA" -p "$LRSCf" >> "$out"

I've tried on two different datasets and I'm getting the same error.

OS info:

[irubia@kepler consent]$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.4.1708 (Core) 
Release:    7.4.1708
Codename:   Core

[irubia@kepler consent]$ g++ --version
g++ (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[irubia@kepler consent]$ python --version
Python 3.6.2
morispi commented 5 years ago

Hey,

Segmentation faults seems to be pretty dataset dependent. Never encountered one (that I did not fix) with all the experiments I've ran (incl. human) so far.

Few questions so I can further help you:

1) Does CONSENT still corrects a few long reads, then crashes, or does it fails to perform correction at all?

2) Have you tried CONSENT on any other dataset? Any small dataset (at least 10x coverage) from any bacterial genome would do, that'd be just to know if the error appears whatever it is you attempt to correct.

3) Is your data public? Found this link (https://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR6058582) Googling the accession ID of your LR file, but no fasta file available for download. If it is public, and if you can provide me with a link to download it, I'd be glad to run CONSENT on your dataset and spot the segfault.

Cheers, Pierre

godkin1211 commented 5 years ago

Yesterday, I encountered this problem, too. My data are 16s full-length nanopore sequencing reads because I have PAF file already, and then I used CONSENT command directly:

$ CONSENT -a Alignments_32234.paf -s 4 -S 1000 -l 500 -k 9 -c 8 -A 2 -f 4 -m 50 -j 20 -r input.fa -M 150 >> output.fa
Segmentation fault (core dumped)
morispi commented 5 years ago

Hi,

I'll have the same 3 questions as just above, please.

Just knowing that there's a segfault somewhere doesn't help me so much if I can't reproduce it to see where it comes from.

Pierre

novikk commented 5 years ago

@morispi

  1. It doesn't correct any read at all.
  2. I've tried in two datasets, but both are RNA (changing U->T) or cDNA, and neither is working. Is this tool only meant to work with DNA data?
  3. The data is public, you can download it from https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?run=SRR6058582 using the SRA toolkit.

Hope this helps!

morispi commented 5 years ago

@novikk

Great! Thanks for the answers and for the link to the dataset.

CONSENT was indeed designed for DNA reads, but I don't see any reason for it to crash on RNA if you switch U to T? Actually I tried it myself this morning on a tiny dataset containing Us, and all went well.

Downloading the data and investigating the issue later tonight. I'll keep you updated.

Cheers P

morispi commented 5 years ago

@novikk Just checked your data. Problems comes from the fact your long reads header contain spaces. Changing the spaces to underscores does the trick for me.

@godkin1211 maybe that's the same thing for you? If the original reads file contains spaces, that'd explain the problem. As you already have the PAF file however, you should just plain trim everything that follows the first white space (sed 's/ .*//g') so that CONSENT can work.

@novikk However, it seems like your reads have a mean length of 152bp? You should check CONSENT parameters and adapt them, as they are meant to be used for much longer reads, that are divided into 500bp windows. Windows longer than the actual reads might cause further issues :p

Cheers P

godkin1211 commented 5 years ago

Thanks to @morispi ! It works after replacing those spaces in header.

morispi commented 5 years ago

@godkin1211 Great!

morispi commented 5 years ago

@novikk

Did you manage to run CONSENT in the end? Did you also check my comment about the parameters above?

Waiting on your answer to close the issue. :)

Cheers Pierre

novikk commented 5 years ago

Hi @morispi, will check it ASAP, probably tomorrow!

novikk commented 5 years ago

Worked fine after renaming the headers of the FASTA and tuning the "windowSize" parameter.

Thanks!