sortmerna / sortmerna

SortMeRNA: next-generation sequence filtering and alignment tool
https://sortmerna.readthedocs.io
GNU General Public License v3.0
240 stars 68 forks source link

Segmentation fault sortmerna #250

Closed george-weingart closed 3 years ago

george-weingart commented 4 years ago

Hi, I am trying to run sortmerna against a 4 million short reads file. I want to use all the 8 databases you provide so I concatenated the fasta files you provide and pass them with the -ref parameter. But obviously, I am doing something wrong.... Here is the log: Thanks!

[process:1369] === Options processing starts ... ===

Found value: sortmerna Found flag: --ref Found value: /n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/Databases/DBsortmerna/DBsortmerna.fasta of previous flag: --ref Found flag: --reads Found value: /n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/Hu90Mi10/human_microbial_mixed.fastq of previous flag: --reads Found flag: -workdir Found value: /n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/run_sortmerna/Workdir of previous flag: -workdir Found flag: -fastx [opt_workdir:1066] Using WORKDIR: ["/n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/run_sortmerna/Workdir"] as specified [process:1453] Processing option: fastx with value: [process:1453] Processing option: reads with value: /n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/Hu90Mi10/human_microbial_mixed.fastq [opt_reads:73] Processing reads file [1] out of total [1] files [process:1453] Processing option: ref with value: /n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/Databases/DBsortmerna/DBsortmerna.fasta [opt_ref:166] Processing reference [1] out of total [1] references [opt_ref:220] File ["/n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/Databases/DBsortmerna/DBsortmerna.fasta"] exists and is readable

[process:1473] === Options processing done ===

[validate_kvdbdir:1252] Key-value DB location "/n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/run_sortmerna/Workdir/kvdb" [validate_kvdbdir:1288] Creating KVDB directory: "/n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/run_sortmerna/Workdir/kvdb" [validate_aligned_pfx:1307] Checking output directory: "/n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/run_sortmerna/Workdir/out"

Program: SortMeRNA version 4.2.0 Copyright: 2016-2020 Clarity Genomics BVBA: Turnhoutseweg 30, 2340 Beerse, Belgium 2014-2016 Knight Lab: Department of Pediatrics, UCSD, La Jolla 2012-2014 Bonsai Bioinformatics Research Group: LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. Contributors: Jenya Kopylova jenya.kopylov@gmail.com Laurent Noé laurent.noe@lifl.fr Pierre Pericard pierre.pericard@lifl.fr Daniel McDonald wasade@gmail.com Mikaël Salson mikael.salson@lifl.fr Hélène Touzet helene.touzet@lifl.fr Rob Knight robknight@ucsd.edu

[main:63] Running command: sortmerna --ref /n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/Databases/DBsortmerna/DBsortmerna.fasta --reads /n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/Hu90Mi10/human_microbial_mixed.fastq -workdir /n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/run_sortmerna/Workdir -fastx [calculate:107] Starting statistics calculation on file: '/n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/Hu90Mi10/human_microbial_mixed.fastq' ... [calculate:225] Done statistics on file. Elapsed time: 1.83 sec. all_reads_count= 4000000 [store_to_db:421] Stored Reads statistics to DB: min_read_len= 100 max_read_len= 100 all_reads_count= 4000000 all_reads_len= 400000000 total_reads_mapped= 0 total_reads_mapped_cov= 0 reads_matched_per_db= TODO is_total_reads_mapped_cov= 0 is_stats_calc= 0

[init:101] Testing file: "/n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/run_sortmerna/Workdir/out/aligned.fastq" [init:218] Testing file: "/n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/run_sortmerna/Workdir/out/aligned.log" Please call the function ssw_init before ssw_align. Please call the function ssw_init before ssw_align. ./run_sortmerna_2.sh: line 9: 4072 Segmentation fault sortmerna --ref ${DBRoot}/${DB1} --reads ${Reads} -workdir ${Workdir} -fastx

biocodz commented 4 years ago

What's the content of the WORKDIR/idx/ i.e.

ls -lrt /n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/run_sortmerna/Workdir/idx

?

george-weingart commented 4 years ago

Hello, Thanks for your response. Before the run, the work directory is empty. After the run idx looks as follows:

(sortmerna) [george.weingart@hutlab12 WorkdirDBsortmerna]$ ls -lrt idx total 1024840 -rw-rw-r-- 1 george.weingart huttenhower_lab 1048576 Jun 23 12:30 17960458063885962820.kmer_0.dat -rw-rw-r-- 1 george.weingart huttenhower_lab 408027348 Jun 23 12:30 17960458063885962820.bursttrie_0.dat -rw-rw-r-- 1 george.weingart huttenhower_lab 637598112 Jun 23 12:30 17960458063885962820.pos_0.dat -rw-rw-r-- 1 george.weingart huttenhower_lab 2750216 Jun 23 12:30 17960458063885962820.stats (sortmerna) [george.weingart@hutlab12 WorkdirDBsortmerna]$

biocodz commented 4 years ago

I cannot see anything wrong in the trace up until the Segfault. What is the output of

which sortmerna | xargs file
george-weingart commented 4 years ago

(sortmerna) [george.weingart@hutlab12 run_sortmerna]$ module load python/3.7.7-fasrc01

The following have been reloaded with a version change: 1) python/2.7.14-fasrc01 => python/3.7.7-fasrc01

(sortmerna) [george.weingart@hutlab12 run_sortmerna]$ conda activate sortmerna (sortmerna) [george.weingart@hutlab12 run_sortmerna]$ which sortmerna | xargs file /n/home09/george.weingart/.conda/envs/sortmerna/bin/sortmerna: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, BuildID[sha1]=b735e386d996ca3b486277f157ec7a6f46994008, not stripped (sortmerna) [george.weingart@hutlab12 run_sortmerna]

biocodz commented 4 years ago

Still see nothing. Could you, please, run the command that your run_sortmerna_2.sh scripts generates without the script (to ensure the script doesn't drag anything extraneous into the process space). You can copy the command from the trace right after [main:63] Running command. Verify the command is a well-formatted single line. You need to remove the kvdb folder i.e. rm -rf /n/holystore01/LABS/huttenhower_lab/Users/george.weingart/KneadData_eval/sortmerna/run_sortmerna/Workdir/kvdb , and then run the command without script.

george-weingart commented 4 years ago

Hello ! First of all, thank you so much for your prompt responses !

I ran the script as your requested and still got the segmentation fault.

I don't want to go on a tangent, but I would like note our objective:

Take a file of 4 million short reads and remove from it all the SRs that align with your 8 databases Is there a way to do it ? Seems the -fastx provides the aligned short reads - we want the unaligned. Perhaps I am doing this exercise wrong? I am concatenating all the fasta files you provide and pass them as one file in the --ref parm

Going back to our problem: I am enclosing 3 files:

  1. sortmerna_DBsortmerna.log: Segmentation log file
  2. run_sortmerna_3.log: This is a successful run: Input=Our 4 million SRs, -ref = bac-16s-id90.fasta (One of the 8 files you provide)
  3. DBSortmerna.txt : Showing the commands I entered

Thanks! George Weingart PhD Huttenhower Lab Biostatistics Department Harvard School of Public Health

Files Enclosed

DBSortmerna.txt run_sortmerna_3.log sortmerna_DBsortmerna.log

biocodz commented 4 years ago

Option '-other' will separate non-aligned reads into a separate file/files. I often recommend in similar circumstances to first try the program in a simplest setup using a single reference file (one out of 8), and a single reads file that contain small amount of reads, say 5K ... 1M, and run the program with the default options e.g.

sortmerna -ref a_ref.fasta -reads a_reads_file.fq -v -threads 10

Just to make sure, the program works, and gain confidence using it. Then the setup can be made more complex.

george-weingart commented 4 years ago

Thanks for the "other" parameter !

But on the segmentation problem ? To summarize our status: 4 Million short reads input file. The run using one of the 8 fasta files you provide works (see run_sortmerna_3.log file) The run against the concatenation of the 8 fasta files you provides get a segmentation fault (See sortmerna_DBsortmerna.log file) Where do we go from here ? Thanks!

biocodz commented 4 years ago

I would try now 2 refs -ref ref_1 -ref ref_2 -reads reads_1 -threads 10. Current Sortmerna release doesn't scale well with too many threads, see issue 231. You might get better runtime with 10 threads. I never tried concatenating the references, although don't see why this wouldn't work. I'll try testing this and will let you know. In any case you can always specify the references using multiple -ref options.

biocodz commented 3 years ago

please use latest release 4.3.4