nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
861 stars 689 forks source link

Check for unexpected species contaminants #271

Open cutsort opened 5 years ago

cutsort commented 5 years ago

Especially with low-input non-human samples, a useful QC step would be to screen for unexpected species (e.g. human, ecoli)

olgabot commented 5 years ago

Would using sourmash (https://github.com/dib-lab/sourmash) help with reducing database size as it subsamples the k-mers from the database using MinHash?

drpatelh commented 5 years ago

I suspect a k-mer based approach is like to be the most efficient way of screening for contaminants in a way where you can pass a relatively unbiased and large database of organisms.

@olgabot Have there been any comparisons made between kraken2 and sourmash? They seem to have similar applications? kraken2 is written in C++ and is quite rapid whereas sourmash is written in Python with some optimisation Im assuming?

apeltzer commented 5 years ago

I use kraken2 in the nf-core/bacass pipeline to check for potential contamination prior to doing the assembly, that works quite nicely though you, of course, need to specify a database still.

olgabot commented 5 years ago

Some information about Sourmash: https://github.com/dib-lab/sourmash/issues/725

And their paper: https://f1000research.com/articles/8-1006

d4straub commented 4 years ago

If (in addition to rRNA removal as suggested in #227) an optional step would be added that would even remove all reads from a particular species, e.g. human, than this pipeline might be able to also efficiently analyze metatranscriptomics from human samples.

olgabot commented 4 years ago

Very interesting! What would you describe as the best way to do host removal?

On Wed, Sep 18, 2019, 16:48 Daniel Straub notifications@github.com wrote:

If (in addition to rRNA removal as suggested in #227 https://github.com/nf-core/rnaseq/issues/227) an optional step would be added that would even remove all reads from a particular species, e.g. human, than this pipeline might be able to also efficiently analyze human metatranscriptomics.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nf-core/rnaseq/issues/271?email_source=notifications&email_token=AAGE24EC737VTVGHGBNWAITQKI5UNA5CNFSM4IPAUCPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7AKQEI#issuecomment-532719633, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGE24GDMVZT7MQ3GVN7YQDQKI5UNANCNFSM4IPAUCPA .

d4straub commented 4 years ago

I am myself not involved in metagenomics of human samples but environmental samples, so my ideas have to be taken with a little caution.

The simplest solution would be using the host genome and forward all unmapped reads (use mapper of choice) for analysis. However, non-host sequences similar to the host could be lost as well in the process. This could be minimized by using strategies such as KRAKEN2 on relevant references (e.g. human + bacteria and remove all that are annotated as human) or DIAMOND (e.g. on whole Ensembl).

Here is an example where contaminant reads were removed by bowtie mapping (in the tool KneadData) to focus on the endogenous E. coli strain.