jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
380 stars 80 forks source link

Metatranscriptomics #200

Closed seppedm closed 3 years ago

seppedm commented 3 years ago

Dear developers,

Our group is working on a soil metatranscriptomic project with a quite large amount of samples. We plan to use the merged mode. Specifically for this kind of project, is there one of the assemblers that you would recommend over the other (first of all with regards to quality) and why?

Second question: It seems it is not possible to run SPAdes in the rnaSPAdes modus, is that correct? The rnaSPAdes manual specifically recommends using rnaSPAdes for metatranscriptomic analyses. Do you expect this will have a big impact on the analysis?

Thanks in advance!!

fpusan commented 3 years ago

is there one of the assemblers that you would recommend over the other (first of all with regards to quality) and why?

Not really, we routinely use megahit due to its lower memory consumption, but we've gotten good results with SPAdes too. Personally, I don't have a formed opinion on which of the two might be better (it seems that the rnaSPAdes mode can be interesting in your case but I have no experience using it).

Second question: It seems it is not possible to run SPAdes in the rnaSPAdes modus, is that correct?

Maybe you can. The -assembly_options when calling SqueezeMeta allows you to pass extra arguments to the assembler. So I think that something along the lines of -a spades -assembly_options "--rna" should make it run RNAspades.

seppedm commented 3 years ago

Thank you for your quick response! I've tried the option you proposed, but since in the 01.run_assembly_merged.pl script is specified that the --meta option will be used (as shown in the line below, line 128) at any time, there is a conflict.

    if(-e $par2name) { $command="$spades_soft --meta --pe1-1 $par1name --pe1-2 $par2name -m 400 $assembler_options -t $numthreads -o $datapath/spades >> $syslogfile  2>&1"; }

This leads to the following error.

== Error == you cannot simultaneously use more than one mode out of Metagenomic, Large genome, Illumina TruSeq, RNA-Seq, Plasmid, and Single-cell!

I assume this can be solved by manually adjusting the file to use --rna instead of --meta, but I don't have the required admin permissions for now.

seppedm commented 3 years ago

I have managed to adjust the SqueezeMeta.pl and 01.run_assembly_merged.pl in a way that I have a functioning pipeline which uses rnaSPAdes. If you would be interested, I asked about the specifics on the different SPAdes modes on their forum.

fpusan commented 3 years ago

Thanks for the heads up! Let us know if this works for you

seppedm commented 3 years ago

Hi all,

Since it's been a long time and I see multiple topics pointing to this one, I will give an update on the results so far. Manually changing the --meta to --rna works and assuming that there is enough RAM to run the more demanding SPAdes, this works. Take in mind, in the newer versions of SPAdes, the algorithm for rnaSPAdes has been changed in the newer versions, which makes it more memory-efficient. Discussion on the SPAdes forum. Using this newer version was necessary in my case, since for some samples it was even impossible to assemble even one sample because of memory issues.

rnaSPAdes seems to give a coverage percentage quite similar to the normal SPAdes, I am not sure yet what the advantage of using this compared to SPAdes would be, although it is the recommended option by the developers themselves.

That said, it was impossible to merge too many samples at once, even in seqmerged mode my estimates were that only the merging step would take 3 months. I therefore chose to cut my samples in three groups and run three 12 sample runs. One could argue that the (yet unknown) advantage that (rna)SPAdes might give in coverage, does not weigh against the better use of resources and time by MEGAHIT, and more correct statistics one can derive from that.

That's it for now, I will let you know any further updates.

jtamames commented 3 years ago

Thank you @seppedm! We will include the new version of SPAdes in SqueezeMeta by default. Regarding the merging, we have a bottleneck in the usage of minimus2, and we haven't found a suitable alternative yet.

Best, Javier

mradz19 commented 3 years ago

Hi @jtamames do you know when you are likely to release an updated squeezemeta with the new version of Spades?

fpusan commented 3 years ago

Hi, it's definitely on the cards. I'd have to find some time to integrate/test, which won't happen for at least the next few weeks.

However i'd say that if you edit the /path/to/SqueezeMeta/SqueezeMeta_conf.pl file, search for $spades_soft and change the value so it points to the executable of the newer version, it _should_work.

mradz19 commented 3 years ago

I wouldn't need to make any alterations to the new spades.py?

fpusan commented 3 years ago

I don't think so. Spades has only minor alterations so it can find the library location within the SqueezeMeta directory structure. A different version should find its own libraries just fine.

mradz19 commented 3 years ago

Hi @fpusan @jtamames. I ran my assembly using spades with the --rna tag. As you suggested I changed the path to the latest version and the assembly seems to have worked, however the run failed transitioning to step 2.

I've attached the syslog and spadeslog file below:

spades.log

syslog.txt

When I try to restart at step 2 I get this error:

restart.pl pluro_all_spades -step 2
[0 seconds]: STEP2 -> RNA PREDICTION: 02.rnas.pl
cp: cannot stat '/scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/results/01.pluro_all_spades.fasta': No such file or directory
  Running barrnap (Seeman 2014, Bioinformatics 30, 2068-9) for predicting RNAs:  Bacteria[14:56:48] Usage: barrnap <file.fasta>
Error running command:    /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/bin/barrnap --quiet --threads 16 --kingdom bac --reject 0.1 /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/intermediate/02.pluro_all_spades.maskedrna.fasta --dbdir /scratch_mollison/users/30041036/squeezemeta/databases/db > /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/temp/bac.gff at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/scripts/02.rnas.pl line 54.
Stopping in STEP2 -> 02.rnas.pl
Died at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/bin/restart.pl line 162.
jtamames commented 3 years ago

Hello What is the result when running this?

/mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/bin/prinseq-lite.pl -fasta /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/data/spades/contigs.fasta -min_len 200 -out_good /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/results/prinseq; mv /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/results/prinseq.fasta /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/results/01.pluro_all_spades.fasta

Looks like the contigs file is missing. Another user has the same error, I will check it asap.

mradz19 commented 3 years ago

Hi, this is the output:

ERROR: could not find input file "/scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/data/spades/contigs.fasta".

Try 'perl prinseq-lite.pl -h' for more information.
Exit program.
mradz19 commented 3 years ago

Also if it helps, this is the contents of the data/spades directory:

assembly_graph_after_simplification.gfa  before_rr.fasta                  input_dataset.yaml  misc            run_spades.sh                    spades.log         transcripts.paths
assembly_graph.fastg                     dataset.info                     K33                 params.txt      run_spades.yaml                  tmp
assembly_graph_with_scaffolds.gfa        hard_filtered_transcripts.fasta  K49                 pipeline_state  soft_filtered_transcripts.fasta  transcripts.fasta

The results/ directory is empty

jtamames commented 3 years ago

Ok, I get it now...

When assemblying rna, spades names the resulting assembly as transcripts.fasta, instead of contigs.fasta as it is the case for DNA, and that is the name SqueezeMeta is looking for.

Just copy the transcripts.fasta file to the results directory, renaming it to 01.<project>.fasta. That is:

cp /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/data/spades/transcripts.fasta /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_spades/results/01.pluro_all_spades.fasta

and restart. That should do it.

Best, J

mradz19 commented 3 years ago

Thanks @jtamames I'll give it a go. Also, when would you use the - - singletons tag when running squeezemeta? I'm not sure I understand correctly, does that essentially include unmapped reads in the annotation steps?

jtamames commented 3 years ago

Hello again

I think it is better if, instead of restarting, you create a new project and provide this transcript.fasta assembly as external assembly using the -extassembly option. The reason is that step 01 also generates an auxiliary file 01.project.lon containing the length of the contigs. If you just copy the file and restart, that file will be missing and will crash the run later on.

Best, J

jtamames commented 3 years ago

Regarding --singletons, I would use it when: 1) You want to be sure not to lose anything in the assembly, or 2) When the mapping percentage is too low and most reads did not make it into the assembly. But this will come at the cost of having myriads of short contigs that will probably burden your analysis

mradz19 commented 3 years ago

OK I will try the external assembly option. Can I include the singletons flag when using an external assembly?

mradz19 commented 3 years ago

And sorry one more question. What is the advantage/disadvantage of running an assembly with --singletons vs running squeezemeta in the reads_only mode?

jtamames commented 3 years ago

Singletons flag: yes reads vs singletons: In the coassembly+singletons you will benefit of the virtues of the assemblies: longer contigs and possibility of binning

mradz19 commented 3 years ago

Thanks for that. I tried re running as you suggested with the external assembly and received this error:

Now merging files
[5 minutes, 13 seconds]: STEP1 -> RUNNING CO-ASSEMBLY: 01.run_assembly.pl (spades)
  External assembly provided: pluro_all_spades/data/spades/transcripts.fasta. Overriding assembly
  Renaming contigs
  Counting length of contigs
  Contigs stored in /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/spades_extass/results/01.spades_extass.fasta
  Number of contigs: 945152
[5 minutes, 31 seconds]: STEP1 ->  ADDING SINGLETONS: 01.remap.pl (spades)
Global symbol "$outsam" requires explicit package name (did you forget to declare "my $outsam"?) at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/scripts/01.remap.pl line 247.
Execution of /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/scripts/01.remap.pl aborted due to compilation errors.
Stopping in STEP1 -> 01.remap.pl (megahit)
Died at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/bin/SqueezeMeta.pl line 790.
mradz19 commented 3 years ago

I tried starting another run using megahit with singletons but I am getting the same error:

SqueezeMeta v1.3.1 - (c) J. Tamames, F. Puente-Sánchez CNB-CSIC, Madrid, SPAIN

Please cite: Tamames & Puente-Sanchez, Frontiers in Microbiology 9, 3349 (2019). doi: https://doi.org/10.3389/fmicb.2018.03349

Run started Tue Feb 16 08:32:44 2021 in coassembly mode
Now creating directories
Reading configuration from /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_singletons/SqueezeMeta_conf.pl
Reading samples from /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_singletons/data/00.pluro_all_singletons.samples
26 samples found: P10_E P6_E P9_M P9_MI P10_B P3_E P1_M P6_B P5_M P8_E P5_B P9_SI P5_E P8_M P1_B C1 P9_B P8_B P4_M P3_B P10_M P4_B C2 P4_E P1_E P9_E

Now merging files
[4 minutes, 18 seconds]: STEP1 -> RUNNING CO-ASSEMBLY: 01.run_assembly.pl (megahit)
  Running assembly with megahit
  Running prinseq (Schmieder et al 2011, Bioinformatics 27(6):863-4) for selecting contigs longer than 200 
Input and filter stats:
        Input sequences: 203,018
        Input bases: 129,674,408
        Input mean length: 638.73
        Good sequences: 203,018 (100.00%)
        Good bases: 129,674,408
        Good mean length: 638.73
        Bad sequences: 0 (0.00%)
        Sequences filtered by specified parameters:
        none
  Renaming contigs
  Counting length of contigs
  Contigs stored in /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/pluro_all_singletons/results/01.pluro_all_singletons.fasta
  Number of contigs: 203018
[5 hours, 17 minutes, 26 seconds]: STEP1 ->  ADDING SINGLETONS: 01.remap.pl (megahit)
Global symbol "$outsam" requires explicit package name (did you forget to declare "my $outsam"?) at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/scripts/01.remap.pl line 247.
Execution of /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/scripts/01.remap.pl aborted due to compilation errors.
Stopping in STEP1 -> 01.remap.pl (megahit)
Died at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/bin/SqueezeMeta.pl line 790.
jtamames commented 3 years ago

Hello

Silly mistake, I am sorry. Change the following lines in 01.remap.pl

Line 96: my($formatseq,$command,$outsam,$formatoption); change to: my($formatseq,$command,$formatoption);

And after line 28: my $samdir="$datapath/sam"; add: my $outsam;

That should do it. Best,

mradz19 commented 3 years ago

Hi @jtamames I've made the changes and squeezemeta has progressed further, I'll update how the run goes with the external RNA assembly and singletons flag

mradz19 commented 3 years ago

Hi @jtamames, this run seemed to stop at step 2 with the following error:

Input and filter stats:
        Input sequences: 29,702,807
        Input bases: 2,838,146,302
        Input mean length: 95.55
        Good sequences: 527,704 (1.78%)
        Good bases: 234,500,912
        Good mean length: 444.38
        Bad sequences: 29,175,103 (98.22%)
        Bad bases: 2,603,645,390
        Bad mean length: 89.24
        Sequences filtered by specified parameters:
        min_len: 29175103
  Counting length of contigs
  Contigs stored in /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/spades_extass/results/01.spades_extass.fasta
  Number of contigs: 527704
rm: cannot remove '/scratch_mollison/users/30041036/squeezemeta/databases/plurogel/spades_extass/data/spades_extass.bowtie.bwt': No such file or directory
[6 hours, 59 minutes, 48 seconds]: STEP2 -> RNA PREDICTION: 02.rnas.pl
  Running barrnap (Seeman 2014, Bioinformatics 30, 2068-9) for predicting RNAs:  Bacteria[14:28:54] Can't find database: /media/disk7/fer/SqueezeMeta/db/bac.hmm
Error running command:    /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/bin/barrnap --quiet --threads 20 --kingdom bac --reject 0.1 /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/spades_extass/intermediate/02.spades_extass.maskedrna.fasta --dbdir /media/disk7/fer/SqueezeMeta/db > /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/spades_extass/temp/bac.gff at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/SqueezeMeta/scripts/02.rnas.pl line 54.
Stopping in STEP2 -> 02.rnas.pl. Program finished abnormally
sh: 1: tree: not found

  If you don't know what went wrong or want further advice, please look for similar issues in https://github.com/jtamames/SqueezeMeta/issues
  Feel free to open a new issue if you don't find the answer there. Please add a brief description of the problem and upload the /scratch_mollison/users/30041036/squeezemeta/databases/plurogel/spades_extass/syslog file (zip it first)
Died at /mnt/nfs/home/30041036/.conda/envs/SqueezeMeta/bin/SqueezeMeta.pl line 1359.
jtamames commented 3 years ago

This points to a bad installation of the databases. Did you move them to a different folder? Try running configure_nodb.pl and if that does not solve the problem, download and install the databases again. Best, J

mradz19 commented 3 years ago

Yep I figured as much. I've decided just to do a clean install of the latest squeezemeta version and databases

jtamames commented 3 years ago

If that doesn´t work or need further advice, please open a new issue. We must be spamming the poor @seppedm with alert messages not related to his original issue. Best,

seppedm commented 3 years ago

Here a final update, for future reference. It seems that when doing transcriptomics, a very large number of unassigned reads is very common if I see other posts. Below two graphs. The first one showing the results for my run (rnaSPAdes), the second one a sqm_reads.pl run (compare T12 and T117). Indeed, more reads are assigned taxonomically in the sqm_reads.pl, but still the large majority is unassigned. It thus seems that the lack of assignment is mainly a database issue when doing (soil) metatranscriptomics, not something related to the assemblies. I can only assume this means there is a lot of science left to be done!!!

When looking at functional assignment, this is significantly different for sqm_reads.pl vs assemblies, mostly in the COG assignment, where there is an overabundance of hypothetical proteins. Here the assembly method seems to have the advantage for more accurate function predictions.

afbeelding afbeelding

jtamames commented 3 years ago

Hello! Thank you very much for this insigth. The improvement of sqm_reads seems to be related to the percentage of unmapped reads in the assembly. I guess that it is due to the small size of some transcripts that perphaps were thrown away either by rnaSPAdes or SqueezeMeta (did you keep the default min contig length of 200?). The percentage of unclassified will probably be reduced by taking advantage of the corresponding metagenome, to which the transcripts could be mapped. My preferred strategy could be: 1) Assembly metagenome; 2) Map transcripts to metagenome; 3) Recover unmapped transcripts and analyze them with sqm_reads.

Best, J