Closed aduvermy closed 3 years ago
You identified the problem!
According to the log you got the output of spades, did you?
/mnt/mydatalocal/atlas/metaT/SRR13523374-interleaved/assembly/transcripts.fasta
To be honest, I kept the RNA workflow within Atlas as I took over the project, but I didn't test it thoroughly. If you can describe me your usecase I can think about adapting the atlas pipline to fit it.
What would you like to do with this trsnscripts? I don't think that the rest of the atlas pipeline, e.g. Binnign is appropriate. Do you also have metagenomes, for the same samples?
Thanks, for your reply!
I understand there is a lot of options. It might take some time to test all of them.
My work takes part in a metagenome and metatranscriptome exploration approach of an aquatic environment.
Currently, I'm using NCBI independents metaGenomics and metaTranscriptomics public datasets of this aquatic environment. But new metaGenomics and metaTranscriptomics libraries for the same sample will follow soon.
I'm trying to choose the best way to analyze this kind of data. Because Spade documentation recommends using it "to assemble metatranscriptomic data", I tried rnaSpades.
I'm also using SqueezeMeta another "Metapipeline" in order to characterize my environment. SqueezeMeta proposes an option for mixing analysis on metaGenome and metaTranscriptome data. I guess it will improve results on genes inference and give access to comparative transcriptomics analysis.
By using Atlas --data-type Metatranscriptome
, I aimed to find another way to processed this kind of data and highlight results.
I fixed the issue MissingOutputException in line 375
(code below), but as you were expecting, the Atlas pipeline seems to not be adapted for this approach based on transcripts assembly. It leads to another issue linked to checkM
filtering. CheckM is waiting for full genomes whereas I give it transcripts. So completeness and contamination are too bad for CheckM.
logger.log
pre_dereplication.log
quality.csv
About binning : Binning of my transcript assembly seems to succeed. I only have to downgrade the DASTools threshold to 0 to avoid an issue. But it does not worry me. In fact, I already downgraded this param when I launched Atlas on a MetaGenome dataset (Atlas default approach), and Atlas found out a bacteria I was expecting. So this param seems to not be primordial. However, for now, it's hard to evaluate the binning quality of transcripts. According to you, binning results will always be of poor quality?
The structure of a protein is three to 10 times more conserved during evolution than its amino acid sequence (Illergård et al. 2009 ). Base on the selection process that occurs on genes, I wonder if metatranscriptomics data could be used to characterize a metagenome with high accuracy.
For someone who aims to obtained rnaSpades output (trancripts.fasta):
Error MissingOutputException in line 375
, seems to be fixed with the trick below.
I modified the file assemble.snakefile
with the lines marked by a *
### update ###
* spades_output_used = {'rna':'transcripts', 'meta':'contigs', 'normal':'contigs'}
rule run_spades:
input:
expand("{{sample}}/assembly/reads/{assembly_preprocessing_steps}_{fraction}.fastq.gz",
fraction=ASSEMBLY_FRACTIONS,
assembly_preprocessing_steps=assembly_preprocessing_steps)
output:
### update ###
* "{{sample}}/assembly/{sequences}.fasta".format(sequences= 'scaffolds' if config['spades_use_scaffolds'] else spades_output_used[config['spades_preset']] )
### old version ###
#"{sample}/assembly/contigs.fasta",
#"{sample}/assembly/scaffolds.fasta"
benchmark:
"logs/benchmarks/assembly/spades/{sample}.txt"
params:
p= lambda wc,input: spades_parameters(wc,input),
k = config.get("spades_k", SPADES_K),
log:
"{sample}/logs/assembly/spades.log"
conda:
"%s/assembly.yaml" % CONDAENV
threads:
config["assembly_threads"]
resources:
mem = config["assembly_memory"],
time= config["runtime"]["assembly"]
shell:
"spades.py "
" --threads {threads} "
" --memory {resources.mem} "
" -o {params.p[outdir]} "
" -k {params.k}"
" {params.p[preset]} "
" {params.p[extra]} "
" {params.p[inputs]} "
" {params.p[longreads]} "
" {params.p[skip_error_correction]} "
" > {log} 2>&1 "
localrules: rename_spades_output
### update ###
* spades_output_used = {'rna':'transcripts', 'meta':'contigs', 'normal':'contigs'}
rule rename_spades_output:
input:
### update ###
* "{{sample}}/assembly/{sequences}.fasta".format(sequences= 'scaffolds' if config['spades_use_scaffolds'] else spades_output_used[config['spades_preset']] )
### old version ###
#"{{sample}}/assembly/{sequences}.fasta".format(sequences= 'scaffolds' if config['spades_use_scaffolds'] else 'contigs' )
output:
temp("{sample}/assembly/{sample}_raw_contigs.fasta")
shell:
"cp {input} {output}"
Good fix, thak you.
I don't think binning is a good Idea. Because neither the abundance nor the tetranucleotide frequencies are reliable in the transcriptome data.
I hope you could predict the genes from the transcripts by running atlas run genecatalog
.
Then you can follow #276
I think this is the best what you could do with a metatranscriptome alone.
With a paired metagenome/ metatranscriptome you could also assemble the metagenome and then map the rna reads to the contigs or mags, to quantify which gene has which expression. I think this is what squeeze meta does.
For this, to would I would need to implement an option that the transcriptome data will not be binned but only used for the mapping.
You said the spades developer recommend the use of rnaspades for meta-transcriptome not only transcriptomes?
Have also a look at https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1116-8 This is a bit older pipeline and I find it difficult to install but it does also analysis of paired metagenome and metatranscriptome data which you could use to do tests.
If I understand, it makes sense that all genes of a specie may not obey to the same evolution process. So they will not follow global tetranucleotide frequencies of the specie. (correct me if I'm wrong)
https://cab.spbu.ru/files/release3.14.1/rnaspades_manual.html
You described exactly the approach of squeezeMeta.
Thanks for these advices ! IMP seems to be a good alternative.
The transcripts might obey the same tetranucleotode frequencies but I assume they are quite short ≤200nt which makes the 4-mer frequencies less reliable.
I think I close the issue or do you have other questions/ suggestions?
Here are some other advice from my friends: https://twitter.com/SilasKieser/status/1389573614517866496
Hi,
Currently, I'm using ATLAS on metatranscriptomics data. I specified --data-type Metatranscriptome as mentioned in the documentation. Firstly, I launched it with the Spade defaults options. Atlas ran perfectly! spades_meta.log
Then, I spot the spades preset option which proposes "RNA mode". Which seems to be more adapted to my datasets. Once this option used, Snakemake printed MissingOutputException in line 375 of atlas / rules / assemble.snakefile 2021-04-26T162305.751610.snakemake.log
As proposed by Snakemake, I increased the latency-wait until 3600 sec to be sure output files have finished being written. But without success ... :/
spades_rna.log
I wonder if rnaSpade outputs file names are consistent with those of metaSpade. https://cab.spbu.ru/files/release3.14.1/rnaspades_manual.html (part 2.3 rnaSPAdes output)
Regards,