ndierckx / NOVOPlasty

NOVOPlasty - The organelle assembler and heteroplasmy caller
Other
176 stars 63 forks source link

excess memory usage using merged read files #141

Open rspfau opened 4 years ago

rspfau commented 4 years ago

I have two runs of the same Illumina library and merged them using 'cat' prior to NOVOplasty. The first run's set of reads did not assemble a mitogenome, so we sequenced a second run of same library which also did not assemble a mitogenome. I was hoping that by merging the two runs, there will be sufficient coverage. However, when I run NOVOplasty using the merged file, the memory usage steadily increases over 1-2 minutes during the 'retrieve seed' process until it reaches 100% and then the process is killed. The log file is empty. Attached are the first 40 lines of the read files, and the config file, so you can see what the read files look like.

I have 7.7 GiB of memory and have successfully run NOVOplasty on read files a thousand times larger. The individual read files are 40-45 MB, and the merged files are ~85 MB.

Here's the command for how I merged the files: cat TK24928-2-first.fastq TK24928-2-second.fastq > mergedTK24928-2-fastq

Thanks! Russell

TK24928-1-first-40.fastq.txt TK24928-1-second-40.fastq.txt TK24928-2-first-40.fastq.txt TK24928-2-second-40.fastq.txt config.txt

ndierckx commented 4 years ago

Hi,

That must be a bug, retrieve seed shouldn't take any memory

You had this problem with the latest version?

rspfau commented 4 years ago

NOVOPlasty3.8.3.pl

ndierckx commented 4 years ago

Could you try the latest version just to see if the problem is still there?

rspfau commented 4 years ago

Yes, same result: pfau@tarleton.edu@TSU98054-LX:~/Desktop/second attempt geomys mitogenomes/TK24928/merged 2nd attempt$ perl ../../NOVOPlasty4.0.pl -c config.txt


NOVOPlasty: The Organelle Assembler Version 4.0 Author: Nicolas Dierckxsens, (c) 2015-2020

Input parameters from the configuration file: Verify if everything is correct

Project:

Project name = TK24928merged Type = mito Genome range = 15000-17000 K-mer = 30 Max memory = 3 Extended log = 1 Save assembled reads = yes Seed Input = ../../Geomys-pinetis-cytb-seed Extend seed directly = no Reference sequence = Variance detection = Chloroplast sequence =

Dataset 1:

Read Length = 301 Insert size = 412 Platform = illumina Single/Paired = PE Combined reads = Forward reads = mergedTK24928-1.fastq Reverse reads = mergedTK24928-2.fastq

Heteroplasmy:

Heteroplasmy = HP exclude list = PCR-free =

Optional:

Insert size auto = yes Use Quality Scores =

Reading Input......OK

Building Hash Table......OK

Subsampled fraction: 99.84 % Forward reads without pair: 466 Reverse reads without pair: 342

Retrieve Seed...

rspfau commented 4 years ago

Here are the complete merged read files https://drive.google.com/file/d/1awqUKHKj_z24W5To_jUE38MmhhuQFNLs/view?usp=sharing

ndierckx commented 4 years ago

I need access to them, I did send a request

ndierckx commented 4 years ago

Hi, Could you also send the seed you used?

rspfau commented 4 years ago

Yes, here it is

Geomys-pinetis-cytb-seed.txt

rspfau commented 4 years ago

I tried a different seed, which I obtained by doing a local blast of the Illumina reads, and it worked without the memory issue--but still not enough depth to assemble mitogenome :(

New seed attached here: TK24928Pfau_2373363-trimmedends.txt

rspfau commented 4 years ago

And I've merged several other reads and haven't had any problems. It was somehow the combination of that particular merged read file and that particular seed file.

ndierckx commented 4 years ago

Hi,

I still had to find the bug, but that is solved now, will upload the new version now.

I tried the assembly, the coverage is indeed to low, but I saw the the reads are heavily trimmed, I would advice to not do that, you loose a lot of data like that...

And best to lower to kmer to 21 or so, better when coverage is low, but more importantly to not trim.

And best not to get 300 bp illumina reads, they have very low quality, best to stick to 250!