metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
377 stars 98 forks source link

Atlas for metatranscriptome. #385

Closed aduvermy closed 3 years ago

aduvermy commented 3 years ago

Hi,

Currently, I'm using ATLAS on metatranscriptomics data. I specified --data-type Metatranscriptome as mentioned in the documentation. Firstly, I launched it with the Spade defaults options. Atlas ran perfectly! spade_dft_parms spades_meta.log

Then, I spot the spades preset option which proposes "RNA mode". Which seems to be more adapted to my datasets. Once this option used, Snakemake printed MissingOutputException in line 375 of atlas / rules / assemble.snakefile 2021-04-26T162305.751610.snakemake.log

As proposed by Snakemake, I increased the latency-wait until 3600 sec to be sure output files have finished being written. But without success ... :/

spades_rna.log

I wonder if rnaSpade outputs file names are consistent with those of metaSpade. https://cab.spbu.ru/files/release3.14.1/rnaspades_manual.html (part 2.3 rnaSPAdes output)

Regards,

SilasK commented 3 years ago

You identified the problem!

According to the log you got the output of spades, did you? /mnt/mydatalocal/atlas/metaT/SRR13523374-interleaved/assembly/transcripts.fasta

To be honest, I kept the RNA workflow within Atlas as I took over the project, but I didn't test it thoroughly. If you can describe me your usecase I can think about adapting the atlas pipline to fit it.

What would you like to do with this trsnscripts? I don't think that the rest of the atlas pipeline, e.g. Binnign is appropriate. Do you also have metagenomes, for the same samples?

aduvermy commented 3 years ago

Thanks, for your reply!

I understand there is a lot of options. It might take some time to test all of them.

My work takes part in a metagenome and metatranscriptome exploration approach of an aquatic environment.

Currently, I'm using NCBI independents metaGenomics and metaTranscriptomics public datasets of this aquatic environment. But new metaGenomics and metaTranscriptomics libraries for the same sample will follow soon.

I'm trying to choose the best way to analyze this kind of data. Because Spade documentation recommends using it "to assemble metatranscriptomic data", I tried rnaSpades.

I'm also using SqueezeMeta another "Metapipeline" in order to characterize my environment. SqueezeMeta proposes an option for mixing analysis on metaGenome and metaTranscriptome data. I guess it will improve results on genes inference and give access to comparative transcriptomics analysis.

By using Atlas --data-type Metatranscriptome, I aimed to find another way to processed this kind of data and highlight results. I fixed the issue MissingOutputException in line 375 (code below), but as you were expecting, the Atlas pipeline seems to not be adapted for this approach based on transcripts assembly. It leads to another issue linked to checkM filtering. CheckM is waiting for full genomes whereas I give it transcripts. So completeness and contamination are too bad for CheckM. logger.log pre_dereplication.log quality.csv

About binning : Binning of my transcript assembly seems to succeed. I only have to downgrade the DASTools threshold to 0 to avoid an issue. But it does not worry me. In fact, I already downgraded this param when I launched Atlas on a MetaGenome dataset (Atlas default approach), and Atlas found out a bacteria I was expecting. So this param seems to not be primordial. Dastools_threshold However, for now, it's hard to evaluate the binning quality of transcripts. According to you, binning results will always be of poor quality?

The structure of a protein is three to 10 times more conserved during evolution than its amino acid sequence (Illergård et al. 2009 ). Base on the selection process that occurs on genes, I wonder if metatranscriptomics data could be used to characterize a metagenome with high accuracy.

For someone who aims to obtained rnaSpades output (trancripts.fasta): Error MissingOutputException in line 375, seems to be fixed with the trick below. I modified the file assemble.snakefile with the lines marked by a *

     ### update ###
*   spades_output_used = {'rna':'transcripts', 'meta':'contigs', 'normal':'contigs'}
    rule run_spades:
        input:
            expand("{{sample}}/assembly/reads/{assembly_preprocessing_steps}_{fraction}.fastq.gz",
                fraction=ASSEMBLY_FRACTIONS,
                assembly_preprocessing_steps=assembly_preprocessing_steps)
        output:
            ### update ###
*           "{{sample}}/assembly/{sequences}.fasta".format(sequences= 'scaffolds' if config['spades_use_scaffolds'] else spades_output_used[config['spades_preset']] )
            ### old version ###
            #"{sample}/assembly/contigs.fasta",
            #"{sample}/assembly/scaffolds.fasta"
        benchmark:
            "logs/benchmarks/assembly/spades/{sample}.txt"
        params:
            p= lambda wc,input: spades_parameters(wc,input),
            k = config.get("spades_k", SPADES_K),
        log:
            "{sample}/logs/assembly/spades.log"
        conda:
            "%s/assembly.yaml" % CONDAENV
        threads:
            config["assembly_threads"]
        resources:
            mem = config["assembly_memory"],
            time= config["runtime"]["assembly"]
        shell:
            "spades.py "
            " --threads {threads} "
            " --memory {resources.mem} "
            " -o {params.p[outdir]} "
            " -k {params.k}"
            " {params.p[preset]} "
            " {params.p[extra]} "
            " {params.p[inputs]} "
            " {params.p[longreads]} "
            " {params.p[skip_error_correction]} "
            " > {log} 2>&1 "

    localrules: rename_spades_output
    ### update ###
*    spades_output_used = {'rna':'transcripts', 'meta':'contigs', 'normal':'contigs'}
    rule rename_spades_output:
        input:
            ### update ###
*           "{{sample}}/assembly/{sequences}.fasta".format(sequences= 'scaffolds' if config['spades_use_scaffolds'] else spades_output_used[config['spades_preset']] )
            ### old version ###
            #"{{sample}}/assembly/{sequences}.fasta".format(sequences= 'scaffolds' if config['spades_use_scaffolds'] else 'contigs' )

        output:
            temp("{sample}/assembly/{sample}_raw_contigs.fasta")
        shell:
            "cp {input} {output}"
SilasK commented 3 years ago

Good fix, thak you.

I don't think binning is a good Idea. Because neither the abundance nor the tetranucleotide frequencies are reliable in the transcriptome data.

I hope you could predict the genes from the transcripts by running atlas run genecatalog. Then you can follow #276 I think this is the best what you could do with a metatranscriptome alone.

With a paired metagenome/ metatranscriptome you could also assemble the metagenome and then map the rna reads to the contigs or mags, to quantify which gene has which expression. I think this is what squeeze meta does.

For this, to would I would need to implement an option that the transcriptome data will not be binned but only used for the mapping.

You said the spades developer recommend the use of rnaspades for meta-transcriptome not only transcriptomes?

SilasK commented 3 years ago

Have also a look at https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1116-8 This is a bit older pipeline and I find it difficult to install but it does also analysis of paired metagenome and metatranscriptome data which you could use to do tests.

aduvermy commented 3 years ago

If I understand, it makes sense that all genes of a specie may not obey to the same evolution process. So they will not follow global tetranucleotide frequencies of the specie. (correct me if I'm wrong)

https://cab.spbu.ru/files/release3.14.1/rnaspades_manual.html Screenshot_2021-05-04 rnaSPAdes manual

You described exactly the approach of squeezeMeta.

Thanks for these advices ! IMP seems to be a good alternative.

SilasK commented 3 years ago

The transcripts might obey the same tetranucleotode frequencies but I assume they are quite short ≤200nt which makes the 4-mer frequencies less reliable.

I think I close the issue or do you have other questions/ suggestions?

SilasK commented 3 years ago

Here are some other advice from my friends: https://twitter.com/SilasKieser/status/1389573614517866496