jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
378 stars 80 forks source link

longread metagenome plus shortread metatranscriptome data #210

Closed bresyd closed 3 years ago

bresyd commented 3 years ago

Hi,

I would like to thank you for your continuous great work with the SQM pipeline, the latest release has some great additional features which are very relevant for my current work. I do have a few questions and would like to get your advice on a couple of things.

I have a new dataset comprising some longread (pacbio) metagenome data plus illumina shortread (single end) metatranscriptome data (dna and rna co-extracted from the same samples). What I would like to do is a coassembly of the longread metagenomes (which I have already done a few trials using canu and flye), then use the coassembly plus the non-assembled reads in your pipeline, and also include the metatranscriptome samples (only for mapping, not binning).

Here are some my my questions/comments:

  1. great that you now also include flye as a longread assembler, I have really good experiences using flye for metagenomes
  2. --singletons: it is great that you have this option for including unassembled reads. From the longread assemblies that I have done so far I know that many of my genes of interest are present on the read level but do not get assembled into contigs. Hence without the option of including unassembled reads I would have to use the longread analysis script only. I know that canu outputs a file containing the unassemble reads but from what I know flye does not have this option. I am just curious to know how you include the unassembled reads when flye is used as the assembler (are you mapping the reads back and then use the unmapped ones)?
  3. again regarding the --singletons: are they also used for the binning or are they not considered since they would not have proper differential coverage?
  4. last --singletons question: if I do the longread assembly externally, is there a way to provide the assembly plus the unassembled reads to the SQM pipeline?
  5. as I mentioned in the beginning, I have longread metagenome and shortread metatranscriptome data. In your wiki you suggest to include the metatranscriptomes as additional samples and I also think that it would be great to have all the samples (DNA plus RNA) within one SQM run. However, I am wondering about the mapping. Would it be possible to use minimap for the mapping of the longread metagenome samples and bowtie for the shortread metatranscriptomes within one single SQM run?
  6. If I can really use the metagenome and the metatranscriptome data in one run, is there an option to tell SQM which samples to use for the binning? Obviously I would like to only use the longread metagenome data for the binning.

Thanks in advance for your help.

Cheers

jtamames commented 3 years ago

Hello Nice to hear that you find that useful our latest upgrade. Your opinions are very important to let us know future directions for improvement. Regarding your particular questions: 2: Yes, we do it in the way you mention. The script 01.remap.pl maps back the reads to the assembly, and adds the unmapped ones to the contig file. 3: Unmapped reads are then considered as "normal" contigs, but are explicitly excluded from the binning steps 4: Yes. The remap step is independent of the assembly, therefore it will run in the same way when a external assembly is provided. 5: Custom mapping, it should be possible but it is not implemented yet. That is an interesting feature for upcoming versions 6: Yes, in the samples file add a fourth column and put "nobinning" in the samples you don´t want to bin. This is explained in the "samples file" section of the manual.

Hope it helps! Best, Javier

bresyd commented 3 years ago

Hi Javier,

thanks for your prompt reply. All of what you said makes sense. Regarding point 5: do you have any suggestion/recommendation on a workaround through which I could already use two different mapping tools for the two different types of data and within one SQM run? Or do you maybe think I should make separate runs with the two data types and then combine them downstream using SQMtools? Any thoughts are very appreciated.

All the best, Benni

bresyd commented 3 years ago

I was too quick with my reply. I think I could actually just try and use bwa mem for both data types since it was designed to work for both, short and long sequences (provided the long reads are of high quality, which my reads are).

Thanks again Benni

jtamames commented 3 years ago

Ok, let us know how it goes

Best,

bresyd commented 3 years ago

sorry to bother you again but there is one more thing I wanted to check: is it possible to tell SQM to only create and include the singletons from a subset of the samples? In my case I would like to include the singletons for the longread metagenome samples but not for the shortread metatranscriptome samples (so basically use the singletons from the same samples that are also used for the assembly but not from the remaining ones).

Thanks again

jtamames commented 3 years ago

Hello It will include all sequences. Would it be possible for you to create a simple script removing short sequences from the resulting fasta? If that is possible, run the project adding -step 1. That will make it to stop after the assembly and remap. Then proccess the fasta file, and restart the project.

bresyd commented 3 years ago

Hi Javier,

great idea. I will give everything a try and let you know how it goes.

Cheers

jtamames commented 3 years ago

You could even use prinseq for removing these short reads