jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
346 stars 81 forks source link

Input error #743

Closed lam-c closed 6 months ago

lam-c commented 8 months ago

Thank you for bringing us this amazing tool! I tried this pipeline on a multi-domain metagenome dateset (external assemblied), but I was stuck on the first step, and couldn't figure out what was going wrong. Look forward to your reply. I hope the information below would be useful.

execution command: $SqueezeMetaPath/scripts/SqueezeMeta.pl -m sequential -s sample_file.txt -f qc_shortreads --nobins --doublepass --euk -t 50

stdout:

***

SqueezeMeta v1.6.2, March 2023 - (c) J. Tamames, F. Puente-Sánchez CNB-CSIC, Madrid, SPAIN

Please cite: Tamames & Puente-Sanchez, Frontiers in Microbiology 9, 3349 (2019). doi: https://doi.org/10.3389/fmicb.2018.03349

Run started Thu Oct 26 14:51:32 2023 in sequential mode
79 metagenomes found: S126 S131 S132 S133 S134 S139 S160 S161 S166 S167 S170 S171 S172 S173 S174 S175 S179 S180 S181 S182 S183 S184 S188 S189 S190 S191 S192 S193 S194 S195 S196 S197 S198 S199 S200 S201 S202 S203 S207 S208 S491 S492 S493 S494 S495 S497 S498 S499 S500 S503 S514 S515 S516 S543 S544 S545 S546 S547 S548 S549 S550 S551 S552 S553 S554 S555 S556 S558 S559 S560 S561 S562 S563 S564 S565 S566 S567 S568 S569

--- SAMPLE S126 ---
Now creating directories
Reading configuration from squeeze_meta/S126/SqueezeMeta_conf.pl
  Running trimmomatic (Bolger et al 2014, Bioinformatics 30(15):2114-20) for quality filtering
  Parameters:
[0 seconds]: STEP1 -> RUNNING ASSEMBLY: 01.run_all_assemblies.pl (megahit)
Assembly not present in squeeze_meta/S126/results/01.S126.fasta, exiting
Stopping in STEP1 -> 01.run_all_assemblies.pl. Program finished abnormally
Died at SqueezeMeta/scripts/SqueezeMeta.pl line 941.

  If you don't know what went wrong or want further advice, please look for similar issues in https://github.com/jtamames/SqueezeMeta/issues
  Feel free to open a new issue if you don't find the answer there. Please add a brief description of the problem and upload the squeeze_meta/S126/syslog file (zip it first)

The external assemblies were resulted from megahit, and renamed (description removed from header)

>k141_0
TTAGTTAATTACCATTTATTTATTTTTAATTAATTCATGAATTTGTTAATTAATTAGTGA
ATTAATTAACTAATGAACTAATTAAGTAATTAATGCATTAATTCATGAATTAATTTATCA
ATGAACTAATGCACTAACTAACGTTTTATTTCATGATTTAATCAATTAGTTAGTTAAGTA
GTTAATTAATTATTCAATGAAATACTTAAATTAATGCATTTATATATATACATATATATA
TATATTTGTTTATGCATATATATTTTTTTGCATCCAAAAATATATGACTTAATGAATATA
TATATTATATTCGAAGG

sample_file.txt syslog.zip

jtamames commented 8 months ago

Hello! Ok, I found a bug in the script 01.run_all_assemblies.pl causing this error. Please edit the script /media/cy/micromamba/envs/squeeze_meta/SqueezeMeta/scripts/01.run_all_assemblies.pl . In line 70, where it reads:

if($extassemblies{$asamples}) {

change to:

if($extassemblies{$asamples}) { 
    $extassembly=$extassemblies{$asamples};

That should fix the error. Tell me otherwise. Best, J

lam-c commented 8 months ago

Thank you for quick response. Unfortunately, it prompted the same error and same syslog after I modified the perl scripts. I wonder if I should add 'noassembly' string into the samplefile, or whether something go wrong with reading samples section (line 40-52). Looks like that raw_fastq not successfully loaded either (the raw_fastq folder is empty).

Run started Fri Oct 27 10:20:09 2023 in sequential mode

SqueezeMeta v1.6.2, March 2023 - (c) J. Tamames, F. Puente-Sánchez CNB-CSIC, Madrid, SPAIN

Please cite: Tamames & Puente-Sanchez, Frontiers in Microbiology 10.3389 (2019). doi: https://doi.org/10.3389/fmicb.2018.03349

Run started for squeeze_meta, Fri Oct 27 10:20:09 2023
Project: S126
Map file: sample_file.txt
Fastq directory: squeeze_meta/qc_shortreads
Command: squeeze_meta/SqueezeMeta/scripts/SqueezeMeta.pl -m sequential -s sample_file.txt -f qc_shortreads --nobins --doublepass --euk -t 50
[0 seconds]: STEP0 -> SqueezeMeta.pl
 COGS; KEGG; PFAM; EUKNOFILTER; DOUBLEPASS;

[0 seconds]: STEP1 -> 01.run_all_assemblies.pl (megahit)
Stopping in STEP1 -> 01.run_all_assemblies.pl. Program finished abnormally
_____________

System information:
_____________

Tree for the project:
ic #202212191242 SMP Mon Dec 19 13:25:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
[4.0K Oct 27 10:20]  squeeze_meta/S126
├── [  31 Oct 27 10:20]  creator.txt
├── [4.0K Oct 27 10:20]  data
│   ├── [ 304 Oct 27 10:20]  00.S126.samples
│   └── [4.0K Oct 27 10:20]  raw_fastq
├── [4.0K Oct 27 10:20]  ext_tables
├── [4.0K Oct 27 10:20]  intermediate
│   └── [4.0K Oct 27 10:20]  binners
├── [ 117 Oct 27 10:20]  methods.txt
├── [3.1K Oct 27 10:20]  parameters.pl
├── [  37 Oct 27 10:20]  progress
├── [4.0K Oct 27 10:20]  results
├── [8.3K Oct 27 10:20]  SqueezeMeta_conf.pl
├── [ 998 Oct 27 10:20]  syslog
└── [4.0K Oct 27 10:20]  temp

8 directories, 7 files

image

jtamames commented 8 months ago

Ok, can reproduce the bug and found a solution for it. It only happens when you specify extaseembly both in pair1 and pair2 of the samples file. When putting that option just in pair1, it works fine. So, first solution would be to change the samples file, removing all "extassembly" from pair2 lines. But of course that is just a patch. To solve the issue, change line 47 in 01.run_all_assemblies.pl, from: if($mapreq=~/extassembly\=(.*)/) { $extassemblies{$sample}=$1; } #-- Store external assemblies if specified in the samples file to if($mapreq=~/extassembly\=(.*)/) { $extassemblies{$sample}=$1; $datasamples{$sample}{$iden}{$file}=1;} #-- Store external assemblies if specified in the samples file

That should do it. Best, J

jtamames commented 8 months ago

Sorry... maybe that fix can create other problems. Do this instead: Change line 48, from: elsif(($mode eq "sequential") && ($sample eq $projectname)) { $datasamples{$sample}{$iden}{$file}=1; } to if(($mode eq "sequential") && ($sample eq $projectname)) { $datasamples{$sample}{$iden}{$file}=1; }

lam-c commented 8 months ago

It works! Many thanks for your timely help.

Sorry... maybe that fix can create other problems. Do this instead: Change line 48, from: elsif(($mode eq "sequential") && ($sample eq $projectname)) { $datasamples{$sample}{$iden}{$file}=1; } to if(($mode eq "sequential") && ($sample eq $projectname)) { $datasamples{$sample}{$iden}{$file}=1; }

lam-c commented 8 months ago

Hi, there! I deploy the newest pipeline on a machine with better computational resource, and run the metagenome data in seqmerge mode (fresh start from the qc reads). However, I get trap in merging assemblies. (syslog and merge assemblies log are shown below)

I search the previous issues and found a similar one [#565 ], which was posted last year. I wonder if there any solution to deal with it now? I'm not sure whether the machine can handle coassembly. Or, do you have any suggestion for it?

syslog.zip mergedassemblies.SY_BIN.runAmos.log

fpusan commented 8 months ago

It seems like you only have 3 samples, is that right? Seqmerge should be able to deal with that, but if it is still stuck you can try a coassembly. What resources do you have exactly?

lam-c commented 8 months ago

It seems like you only have 3 samples, is that right? Seqmerge should be able to deal with that, but if it is still stuck you can try a coassembly. What resources do you have exactly?

Thank you for your patience. I have 79 samples in total, and feed them into SqueezeMeta in 23 groups (each group contains 3-6 samples, in order to retrieve MAGs from each group), running in parallel.

I turned to co-assembly mode after requesting for more computational resources (shown below) and limited only 15 groups (smaller fastq) running at the same time. Luckily the peak RAM was less than 2T (not sure about how to estimate the required RAM).

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              384

free -m
              total        used        free      shared  buff/cache   available
Mem:        6190948      273186      208143        4105     5709618     5885262

However, it's weird that 3 samples (~22GB for par(1|2).fastq.gz respectively, in folder output/data/raw_fastq) not accepted in seqmerge mode (it did work on the test data downloaded along with databases). Is there any solution that we can try to fix it, considering saving cost of time and computational resources in the future?

fpusan commented 8 months ago

Tbf the individual assemblies are quite big, 5x as bigger than those of the test dataset. I very much suspect that minimus2 scales quadratically meaning that exec time and maybe memory usage would be maybe 25x higher. You are better off trying coassemblies, I suspect. Otherwise you can use the sequential mode and combine the results for the different samples later.

lam-c commented 8 months ago

OK, I will take that advice. Thanks for your time ~

fpusan commented 6 months ago

Closing due to lack of activity, feel free to reopen