Closed seppedm closed 3 years ago
is there one of the assemblers that you would recommend over the other (first of all with regards to quality) and why?
Not really, we routinely use megahit due to its lower memory consumption, but we've gotten good results with SPAdes too. Personally, I don't have a formed opinion on which of the two might be better (it seems that the rnaSPAdes mode can be interesting in your case but I have no experience using it).
Second question: It seems it is not possible to run SPAdes in the rnaSPAdes modus, is that correct?
Maybe you can. The -assembly_options
when calling SqueezeMeta allows you to pass extra arguments to the assembler. So I think that something along the lines of -a spades -assembly_options "--rna"
should make it run RNAspades.
Thank you for your quick response! I've tried the option you proposed, but since in the 01.run_assembly_merged.pl script is specified that the --meta option will be used (as shown in the line below, line 128) at any time, there is a conflict.
if(-e $par2name) { $command="$spades_soft --meta --pe1-1 $par1name --pe1-2 $par2name -m 400 $assembler_options -t $numthreads -o $datapath/spades >> $syslogfile 2>&1"; }
This leads to the following error.
== Error == you cannot simultaneously use more than one mode out of Metagenomic, Large genome, Illumina TruSeq, RNA-Seq, Plasmid, and Single-cell!
I assume this can be solved by manually adjusting the file to use --rna instead of --meta, but I don't have the required admin permissions for now.
I have managed to adjust the SqueezeMeta.pl and 01.run_assembly_merged.pl in a way that I have a functioning pipeline which uses rnaSPAdes. If you would be interested, I asked about the specifics on the different SPAdes modes on their forum.
Thanks for the heads up! Let us know if this works for you
Hi all,
Since it's been a long time and I see multiple topics pointing to this one, I will give an update on the results so far. Manually changing the --meta to --rna works and assuming that there is enough RAM to run the more demanding SPAdes, this works. Take in mind, in the newer versions of SPAdes, the algorithm for rnaSPAdes has been changed in the newer versions, which makes it more memory-efficient. Discussion on the SPAdes forum. Using this newer version was necessary in my case, since for some samples it was even impossible to assemble even one sample because of memory issues.
rnaSPAdes seems to give a coverage percentage quite similar to the normal SPAdes, I am not sure yet what the advantage of using this compared to SPAdes would be, although it is the recommended option by the developers themselves.
That said, it was impossible to merge too many samples at once, even in seqmerged mode my estimates were that only the merging step would take 3 months. I therefore chose to cut my samples in three groups and run three 12 sample runs. One could argue that the (yet unknown) advantage that (rna)SPAdes might give in coverage, does not weigh against the better use of resources and time by MEGAHIT, and more correct statistics one can derive from that.
That's it for now, I will let you know any further updates.
Thank you @seppedm! We will include the new version of SPAdes in SqueezeMeta by default. Regarding the merging, we have a bottleneck in the usage of minimus2, and we haven't found a suitable alternative yet.
Best, Javier
Hi @jtamames do you know when you are likely to release an updated squeezemeta with the new version of Spades?
Hi, it's definitely on the cards. I'd have to find some time to integrate/test, which won't happen for at least the next few weeks.
However i'd say that if you edit the /path/to/SqueezeMeta/SqueezeMeta_conf.pl
file, search for $spades_soft
and change the value so it points to the executable of the newer version, it _should_work.
I wouldn't need to make any alterations to the new spades.py?
I don't think so. Spades has only minor alterations so it can find the library location within the SqueezeMeta directory structure. A different version should find its own libraries just fine.
Hi @fpusan @jtamames. I ran my assembly using spades with the --rna tag. As you suggested I changed the path to the latest version and the assembly seems to have worked, however the run failed transitioning to step 2.
I've attached the syslog and spadeslog file below:
When I try to restart at step 2 I get this error:
restart.pl pluro_all_spades -step 2
[0 seconds]: STEP2 -> RNA PREDICTION: 02.rnas.pl
cp: cannot stat '/scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/results/01.pluro_all_spades.fasta': No such file or directory
Running barrnap (Seeman 2014, Bioinformatics 30, 2068-9) for predicting RNAs: Bacteria[14:56:48] Usage: barrnap <file.fasta>
Error running command: /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/bin/barrnap --quiet --threads 16 --kingdom bac --reject 0.1 /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/intermediate/02.pluro_all_spades.maskedrna.fasta --dbdir /scratch_mollison/users/30041036/squeezemeta/databases/db > /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/temp/bac.gff at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/scripts/02.rnas.pl line 54.
Stopping in STEP2 -> 02.rnas.pl
Died at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/bin/restart.pl line 162.
Hello What is the result when running this?
/mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/bin/prinseq-lite.pl -fasta /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/data/spades/contigs.fasta -min_len 200 -out_good /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/results/prinseq; mv /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/results/prinseq.fasta /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/results/01.pluro_all_spades.fasta
Looks like the contigs file is missing. Another user has the same error, I will check it asap.
Hi, this is the output:
ERROR: could not find input file "/scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/data/spades/contigs.fasta".
Try 'perl prinseq-lite.pl -h' for more information.
Exit program.
Also if it helps, this is the contents of the data/spades directory:
assembly_graph_after_simplification.gfa before_rr.fasta input_dataset.yaml misc run_spades.sh spades.log transcripts.paths
assembly_graph.fastg dataset.info K33 params.txt run_spades.yaml tmp
assembly_graph_with_scaffolds.gfa hard_filtered_transcripts.fasta K49 pipeline_state soft_filtered_transcripts.fasta transcripts.fasta
The results/ directory is empty
Ok, I get it now...
When assemblying rna, spades names the resulting assembly as transcripts.fasta
, instead of contigs.fasta
as it is the case for DNA, and that is the name SqueezeMeta is looking for.
Just copy the transcripts.fasta
file to the results directory, renaming it to 01.<project>.fasta
. That is:
cp /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/data/spades/transcripts.fasta /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/results/01.pluro_all_spades.fasta
and restart. That should do it.
Best, J
Thanks @jtamames I'll give it a go. Also, when would you use the - - singletons tag when running squeezemeta? I'm not sure I understand correctly, does that essentially include unmapped reads in the annotation steps?
Hello again
I think it is better if, instead of restarting, you create a new project and provide this transcript.fasta assembly as external assembly using the -extassembly option. The reason is that step 01 also generates an auxiliary file 01.project.lon containing the length of the contigs. If you just copy the file and restart, that file will be missing and will crash the run later on.
Best, J
Regarding --singletons, I would use it when: 1) You want to be sure not to lose anything in the assembly, or 2) When the mapping percentage is too low and most reads did not make it into the assembly. But this will come at the cost of having myriads of short contigs that will probably burden your analysis
OK I will try the external assembly option. Can I include the singletons flag when using an external assembly?
And sorry one more question. What is the advantage/disadvantage of running an assembly with --singletons vs running squeezemeta in the reads_only mode?
Singletons flag: yes reads vs singletons: In the coassembly+singletons you will benefit of the virtues of the assemblies: longer contigs and possibility of binning
Thanks for that. I tried re running as you suggested with the external assembly and received this error:
Now merging files
[5 minutes, 13 seconds]: STEP1 -> RUNNING CO-ASSEMBLY: 01.run_assembly.pl (spades)
External assembly provided: pluro_all_spades/data/spades/transcripts.fasta. Overriding assembly
Renaming contigs
Counting length of contigs
Contigs stored in /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/spades_extass/results/01.spades_extass.fasta
Number of contigs: 945152
[5 minutes, 31 seconds]: STEP1 -> ADDING SINGLETONS: 01.remap.pl (spades)
Global symbol "$outsam" requires explicit package name (did you forget to declare "my $outsam"?) at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/scripts/01.remap.pl line 247.
Execution of /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/scripts/01.remap.pl aborted due to compilation errors.
Stopping in STEP1 -> 01.remap.pl (megahit)
Died at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/bin/SqueezeMeta.pl line 790.
I tried starting another run using megahit with singletons but I am getting the same error:
SqueezeMeta v1.3.1 - (c) J. Tamames, F. Puente-Sánchez CNB-CSIC, Madrid, SPAIN
Please cite: Tamames & Puente-Sanchez, Frontiers in Microbiology 9, 3349 (2019). doi: https://doi.org/10.3389/fmicb.2018.03349
Run started Tue Feb 16 08:32:44 2021 in coassembly mode
Now creating directories
Reading configuration from /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_singletons/SqueezeMeta_conf.pl
Reading samples from /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_singletons/data/00.pluro_all_singletons.samples
26 samples found: P10_E P6_E P9_M P9_MI P10_B P3_E P1_M P6_B P5_M P8_E P5_B P9_SI P5_E P8_M P1_B C1 P9_B P8_B P4_M P3_B P10_M P4_B C2 P4_E P1_E P9_E
Now merging files
[4 minutes, 18 seconds]: STEP1 -> RUNNING CO-ASSEMBLY: 01.run_assembly.pl (megahit)
Running assembly with megahit
Running prinseq (Schmieder et al 2011, Bioinformatics 27(6):863-4) for selecting contigs longer than 200
Input and filter stats:
Input sequences: 203,018
Input bases: 129,674,408
Input mean length: 638.73
Good sequences: 203,018 (100.00%)
Good bases: 129,674,408
Good mean length: 638.73
Bad sequences: 0 (0.00%)
Sequences filtered by specified parameters:
none
Renaming contigs
Counting length of contigs
Contigs stored in /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_singletons/results/01.pluro_all_singletons.fasta
Number of contigs: 203018
[5 hours, 17 minutes, 26 seconds]: STEP1 -> ADDING SINGLETONS: 01.remap.pl (megahit)
Global symbol "$outsam" requires explicit package name (did you forget to declare "my $outsam"?) at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/scripts/01.remap.pl line 247.
Execution of /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/scripts/01.remap.pl aborted due to compilation errors.
Stopping in STEP1 -> 01.remap.pl (megahit)
Died at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/bin/SqueezeMeta.pl line 790.
Hello
Silly mistake, I am sorry. Change the following lines in 01.remap.pl
Line 96:
my($formatseq,$command,$outsam,$formatoption);
change to:
my($formatseq,$command,$formatoption);
And after line 28:
my $samdir="$datapath/sam";
add:
my $outsam;
That should do it. Best,
Hi @jtamames I've made the changes and squeezemeta has progressed further, I'll update how the run goes with the external RNA assembly and singletons flag
Hi @jtamames, this run seemed to stop at step 2 with the following error:
Input and filter stats:
Input sequences: 29,702,807
Input bases: 2,838,146,302
Input mean length: 95.55
Good sequences: 527,704 (1.78%)
Good bases: 234,500,912
Good mean length: 444.38
Bad sequences: 29,175,103 (98.22%)
Bad bases: 2,603,645,390
Bad mean length: 89.24
Sequences filtered by specified parameters:
min_len: 29175103
Counting length of contigs
Contigs stored in /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/spades_extass/results/01.spades_extass.fasta
Number of contigs: 527704
rm: cannot remove '/scratch_mollison/users/30041036/squeezemeta/databases/plurogel/spades_extass/data/spades_extass.bowtie.bwt': No such file or directory
[6 hours, 59 minutes, 48 seconds]: STEP2 -> RNA PREDICTION: 02.rnas.pl
Running barrnap (Seeman 2014, Bioinformatics 30, 2068-9) for predicting RNAs: Bacteria[14:28:54] Can't find database: /media/disk7/fer/SqueezeMeta/db/bac.hmm
Error running command: /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/bin/barrnap --quiet --threads 20 --kingdom bac --reject 0.1 /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/spades_extass/intermediate/02.spades_extass.maskedrna.fasta --dbdir /media/disk7/fer/SqueezeMeta/db > /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/spades_extass/temp/bac.gff at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/scripts/02.rnas.pl line 54.
Stopping in STEP2 -> 02.rnas.pl. Program finished abnormally
sh: 1: tree: not found
If you don't know what went wrong or want further advice, please look for similar issues in https://github.com/jtamames/SqueezeMeta/issues
Feel free to open a new issue if you don't find the answer there. Please add a brief description of the problem and upload the /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/spades_extass/syslog file (zip it first)
Died at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/bin/SqueezeMeta.pl line 1359.
This points to a bad installation of the databases. Did you move them to a different folder?
Try running configure_nodb.pl
and if that does not solve the problem, download and install the databases again.
Best,
J
Yep I figured as much. I've decided just to do a clean install of the latest squeezemeta version and databases
If that doesn´t work or need further advice, please open a new issue. We must be spamming the poor @seppedm with alert messages not related to his original issue. Best,
Here a final update, for future reference. It seems that when doing transcriptomics, a very large number of unassigned reads is very common if I see other posts. Below two graphs. The first one showing the results for my run (rnaSPAdes), the second one a sqm_reads.pl run (compare T12 and T117). Indeed, more reads are assigned taxonomically in the sqm_reads.pl, but still the large majority is unassigned. It thus seems that the lack of assignment is mainly a database issue when doing (soil) metatranscriptomics, not something related to the assemblies. I can only assume this means there is a lot of science left to be done!!!
When looking at functional assignment, this is significantly different for sqm_reads.pl vs assemblies, mostly in the COG assignment, where there is an overabundance of hypothetical proteins. Here the assembly method seems to have the advantage for more accurate function predictions.
Hello! Thank you very much for this insigth. The improvement of sqm_reads seems to be related to the percentage of unmapped reads in the assembly. I guess that it is due to the small size of some transcripts that perphaps were thrown away either by rnaSPAdes or SqueezeMeta (did you keep the default min contig length of 200?). The percentage of unclassified will probably be reduced by taking advantage of the corresponding metagenome, to which the transcripts could be mapped. My preferred strategy could be: 1) Assembly metagenome; 2) Map transcripts to metagenome; 3) Recover unmapped transcripts and analyze them with sqm_reads.
Best, J
Dear developers,
Our group is working on a soil metatranscriptomic project with a quite large amount of samples. We plan to use the merged mode. Specifically for this kind of project, is there one of the assemblers that you would recommend over the other (first of all with regards to quality) and why?
Second question: It seems it is not possible to run SPAdes in the rnaSPAdes modus, is that correct? The rnaSPAdes manual specifically recommends using rnaSPAdes for metatranscriptomic analyses. Do you expect this will have a big impact on the analysis?
Thanks in advance!!