Progress indicator for Merging with minimus2

olar785 commented 4 years ago

Hi Javier, First of all I just want to say that I love the SqueezeMeta pipeline and the way you can easily access and visualise everything in R afterward. The reason of my message is that I have been running the pipeline using the "merged" co-assembly option on 10 metagenomes using 16 cpus and 104 GB of RAM. The step Merging with minimus2 has been running for 5 days now and well, I wish I could have an idea of how far in the process the pipeline is. Is there any way to tell from the profiling.99.delta file how long this step may still need? Any chance in the future to add a progress bar for this step? Thanks in advance, Olivier

p.s. I am running version 1.1.0 of SqueezeMeta, the latest I believe

fpusan commented 4 years ago

Hi,

We are aware of minimus2 being painfully slow. We are searching for faster alternatives, but haven't settled on anything quite yet.

However you can try the "seqmerge" mode, which is a bit more clever on how to merge the different metagenomes and should take less time.

Finally, with only 10 samples maybe you can try going for a coassembly. It might work with yout 104 GB of RAM, and if it fails at least it will fail in the first step you you don't have to wait a lot.

Let me know how this works for you,

Fernando

olar785 commented 4 years ago

Awesome, thanks for the quick reply Fernando. From experience, do you have an idea of how much longer the merging may take? Few more days? A week or 2? Just wondering if I should stop the process and try the other options you mentioned. Cheers

fpusan commented 4 years ago

It is hard to tell. Minimus2 scales badly with the size of the data. I would honestly go for the other alternatives, starting from the coassembly. @jtamames, what do you think?

jtamames commented 4 years ago

Hello Having 104 Gb RAM probably you could try coassembly. How big are your metagenomes?

olar785 commented 4 years ago

My fastq files are about 2.5 GB each, so basically 5 GB per sample if considering forward and reverse reads. I'll give a try to coassembly 👍🏼

jtamames commented 4 years ago

Ok, let us know the result. Good luck!

olar785 commented 4 years ago

Hi guys, I don't know if I should open another issue for this but basically, when I replace merged by coassembly I get the following error:

There must be at least one 'pair1' sequence file in your samples file and there is none!

I tried the same command again but with the merged option and I do not have that issue. Any idea what might be going on?

Here is my command: SqueezeMeta.pl -m coassembly -p SM_coassembly_profiling -s samples.txt -f fastq_files/raw_data --cleaning --doublepass

jtamames commented 4 years ago

Could you please paste your samples file?

olar785 commented 4 years ago

samples.txt Here it is

jtamames commented 4 years ago

Could you please ls -l your <project>/data/raw_fastq directory and paste the result?

olar785 commented 4 years ago

total 51700208 -rw-rw-r-- 1 ashitakastudio ashitakastudio 2647918521 Apr 12 19:04 10-B7-18_S8_L001_R1_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2710123411 Apr 12 19:43 10-B7-18_S8_L001_R2_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 3593647073 Apr 12 20:30 1-B4-3_S5_L001_R1_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 3299216340 Apr 12 18:46 1-B4-3_S5_L001_R2_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2758753577 Apr 12 18:27 2-B4-4_S7_L001_R1_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2781825988 Apr 12 20:31 2-B4-4_S7_L001_R2_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2544251151 Apr 12 19:02 3-B4-5_S9_L001_R1_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2680194062 Apr 12 20:52 3-B4-5_S9_L001_R2_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2245737974 Apr 12 20:04 4-B1-13_S10_L001_R1_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2285033272 Apr 12 19:25 4-B1-13_S10_L001_R2_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2754810860 Apr 12 18:27 5-B1-14_S1_L001_R1_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2820995623 Apr 12 20:51 5-B1-14_S1_L001_R2_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 1974395726 Apr 12 19:29 6-B1-15_S2_L001_R1_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2008823966 Apr 12 19:11 6-B1-15_S2_L001_R2_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2505916283 Apr 12 19:21 7-B8-11_S3_L001_R1_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2253961965 Apr 12 19:56 7-B8-11_S3_L001_R2_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2885015133 Apr 12 19:39 8-B9-11_S4_L001_R1_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2795416300 Apr 12 19:16 8-B9-11_S4_L001_R2_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2674953986 Apr 12 21:03 9-B6-1_S6_L001_R1_001.fastq.gz -rw-rw-r-- 1 ashitakastudio ashitakastudio 2719901110 Apr 12 18:43 9-B6-1_S6_L001_R2_001.fastq.gz

jtamames commented 4 years ago

Hello Could you try running the dos2unix command on your samples file: dos2unix samples.txt and see if SqueezeMeta works after that?

olar785 commented 4 years ago

Success! Thanks for your time guys. I will now test if I have enough RAM for coassembly and let you know how it goes.

olar785 commented 4 years ago

Hi, So unfortunately, 104 GB was not enough for coassembly. I'm working on google cloud so I increased the RAM of my instance to 250 GB but this time, this is the error I got from megahit log. Any idea what may have caused that error?

2020-04-23 01:44:50 - Assemble contigs from SdBG for k = 21 2020-04-23 01:44:50 - command /home/ashitakastudio/miniconda3/envs/SqueezeMeta/SqueezeMeta/bin/megahit/megahit_core assemble -s /home/ashitakastudio/My_project/SM_coassembly_profiling/data/megahit/tmp/k21/21 -o /home/ashitakastudio/My_project/SM_coassembly_profiling/data/megahit/intermediate_contigs/k21 -t 15 --min_standalone 300 --prune_level 2 --merge_len 20 --merge_similar 0.95 --cleaning_rounds 5 --disconnect_ratio 0.1 --low_local_ratio 0.2 --cleaning_rounds 5 --min_depth 2 --bubble_level 2 --max_tip_len -1 --careful_bubble 2020-04-23 01:44:50 - b'INFO main_assemble.cpp : 129 - Loading succinct de Bruijn graph: /home/ashitakastudio/My_project/SM_coassembly_profiling/data/megahit/tmp/k21/21FATAL utils/utils.h : 172 - Invalid format. Expect field k, got' 2020-04-23 01:44:50 - Error occurs, please refer to /home/ashitakastudio/My_project/SM_coassembly_profiling/data/megahit/log for detail 2020-04-23 01:44:50 - Command: /home/ashitakastudio/miniconda3/envs/SqueezeMeta/SqueezeMeta/bin/megahit/megahit_core assemble -s /home/ashitakastudio/My_project/SM_coassembly_profiling/data/megahit/tmp/k21/21 -o /home/ashitakastudio/My_project/SM_coassembly_profiling/data/megahit/intermediate_contigs/k21 -t 15 --min_standalone 300 --prune_level 2 --merge_len 20 --merge_similar 0.95 --cleaning_rounds 5 --disconnect_ratio 0.1 --low_local_ratio 0.2 --cleaning_rounds 5 --min_depth 2 --bubble_level 2 --max_tip_len -1 --careful_bubble; Exit code 1 (END)

fpusan commented 4 years ago

Weird, because Megahit was working for you when using the merged mode.

What is the content of the /home/ashitakastudio/My_project/SM_coassembly_profiling/data/megahit/log file?

olar785 commented 4 years ago

log.txt Here it is. It is mostly what I have copy/pasted above. I will try to run it again and see if that comes up again, who knows... However, when I use the restart.pl function, the program basically start from scratch, performing the cleaning with Trimmomatic again, probably because cleaning and assembling are both part of step 1. Is it possible to skip that so as to avoid the large files par1.fastq.gz and par2.fatsq.gz being re-created?

fpusan commented 4 years ago

For testing purposes you can try to just repeat the megahit step. The command used for launching megahit should be available in the syslog file within the SM_coassembly_profiling.

If you isolate the megahit command that is failing, you can also contact the megahit author at https://github.com/voutcn/megahit.

Meanwhile, you can try the seqmerge mode, as megahit was working for you with smaller assemblies.

You can also try using SPAdes instead of megahit (with the -a spades parameter when calling SqueezeMeta). SPAdes is however more memory-hungry than megahit.

olar785 commented 4 years ago

Ok thanks Fernando. I will just re-test megahit again then. On a final note, do you think 250 GB of RAM is enough for spades is I go that route?

fpusan commented 4 years ago

The problem with assembly is that it does not only depend on the number of reads, but also on the complexity of the samples, so it is hard to tell beforehand. No way to know without trying.

olar785 commented 4 years ago

Okay I see. Thank you very much for your help and prompt responses.

jtamames / SqueezeMeta

Progress indicator for Merging with minimus2 #91