max memory parameter ignored

LorenaDerezanin commented 4 years ago

Hi Nicolas, I'm running your latest NOVOplasty4.0 on downsampled interleaved reads from WGS data, trying to assemble tayra mitogenome (kind of a tropical weasel). Reads have only been trimmed for adapters. I'm using mitogenome sequence of a closely related species, domestic ferret, as the seed. I specified 150 GB to be the memory limit, but it has already passed 204 GB while still at the 'Building hash table' stage. The size of the gzipped fastq read file (tayra_dwnsmp25_intlvd.fq.gz) is 28GB.

I previously used NOVOplasty3.7. to assemble meerkat mitogenome and it worked like a charm. Although, I didn't use max mem option in that case, and kmer size was 39, forward and reverse read files were each ~ 20 GB (fq.gz)

Here is the console output for the NOVOplasty4.0 tayra mitogenome run:

Input parameters from the configuration file:   *** Verify if everything is correct ***

Project:
-----------------------
Project name          = tayra1
Type                  = mito
Genome range          = 12000-22000
K-mer                 = 33
Max memory            = 150
Extended log          = 
Save assembled reads  = no
Seed Input            = dom_fer_mitogenome.fa
Extend seed directly  = no
Reference sequence    = 
Variance detection    = 
Chloroplast sequence  = 

Dataset 1:
-----------------------
Read Length           = 151
Insert size           = 350
Platform              = illumina
Single/Paired         = PE
Combined reads        = tayra_dwnsmp25_intlvd.fq.gz
Forward reads         = 
Reverse reads         = 

Heteroplasmy:
-----------------------
Heteroplasmy          = 
HP exclude list       = 
PCR-free              = no

Optional:
-----------------------
Insert size auto      = yes
Use Quality Scores    = 

Reading Input......OK

Building Hash Table...

What do think I might be doing wrong?

Thank you in advance! Lorena

ndierckx commented 4 years ago

Hi,

The max memory option should not be used when you use combined files. I didn't think about that because I thought everybody has interleaved files You don't have the forward and reverse files separately?

LorenaDerezanin commented 4 years ago

Hi, thank you for clearing this out, I wasn't aware of the discrepancy in using max mem parameter with combined reads. Actually, I passed a file with downsampled interleaved reads under the 'combined reads' parameter, assuming this would be fine. I was hoping to speed up the process by using a downsampled interleaved read set instead of the whole WGS data set which is ~120GB large (gzipped). As I indeed do have forward and reverse reads in separate files, I ran a new round of NOVOplasty runs, this time with the full set (not downsampled) of forward and reverse reads passed as separate files. I've set max mem to 128 GB and ran 2 jobs, each with different seed input in their config file: wolverine reference and once again domestic ferret reference, all other parameters being the same. For some unknown reason, run with a domestic ferret as a seed ref. failed to produce a circular assembly. But here is the output of the successful run with the wolverine reference:

Project:
----------------------
Project name          = tayra2
Type                  = mito
Genome range          = 12000-22000
K-mer                 = 33
Max memory            = 128
Extended log          = 1
Save assembled reads  = no
Seed Input            = wolverine_mitogenome.fa
Extend seed directly  = no
Reference sequence    = 
Variance detection    = 
Chloroplast sequence  = 

Dataset 1:
----------------------
Read Length           = 151
Insert size           = 350
Platform              = illumina
Single/Paired         = PE
Combined reads        = 
Forward reads         = reads/adapt_trimmed_R1_val_1.fq.gz
Reverse reads         = reads/adapt_trimmed_R2_val_2.fq.gz

Heteroplasmy:
-----------------------
Heteroplasmy          = 
HP exclude list       = 
PCR-free              = no

Optional:
----------------------
Insert size auto      = yes
Use Quality Scores    = 

Subsampled fraction: 12.22 %

Retrieve Seed...

Initial read retrieved successfully: AGCTTATTAAATTAAAGCAAGGCACTGAAAATGCCTAGAAGAGCCATCAGGCTCCATAAACACAAAGGTTTGGTCCTGGCCTTCCTATTAATTATTAACAGAATTACACATGCAAGTCTCCGCACCCCGGTGAAAATGCCCTCTAAATCC

Start Assembly...

-----------------Assembly 1 finished successfully: The genome has been circularized-----------------

Contig 1                  : 16559 bp

Total contigs              : 1
Largest contig             : 16559 bp
Smallest contig            : 16559 bp
Average insert size        : 395 bp

-----------------------------------------Input data metrics-----------------------------------------

Total reads                : 210617062
Aligned reads              : 78246
Assembled reads            : 52988
Organelle genome %         : 0.04 %
Average organelle coverage : 713

I've called consensus, checked, and aligned the sequence to related species including ones used as seed sequences in the run, all looks good.

Thanks again.

ndierckx commented 4 years ago

Ok that's great. Using a complete genome as seed is not the best option, better to use a small region like COI, because it will just use the first 100 or 200 bp to seed. The species won't matter but it does matter where in the genome it starts. For chloroplasts i added a seed that works for mast species but didn't for mitochondria. You could use a close genome as reference, it can help merging contigs in some cases . But since it is completely assembled don't need to bother for this case

LorenaDerezanin commented 4 years ago

I gave it another try and repeated the failed runs. As you suggested, this time instead of using the complete mitogenomes of related species as seeds, I extracted the CYTB CDS region from domestic ferret mitogenome and COX1 (COI) from sable. Both runs finished successfully with circularized genomes as outputs having identical sequence content as the one previously obtained from the complete mitogenome seed.

Thank you once again for helping out!

ndierckx / NOVOPlasty

max memory parameter ignored #143