Add reference-based short read removal to the anvi'o snakemake metagenomics workflow

ShaiberAlon commented 6 years ago

I chose the following approach: after quality filtering, each fastq file would be:

mapped using bowtie2 to the references in references-for-removal.txt
output sam file would be converted to a bam file, and will be either deleted at the end of the process or saved (depending on what the user specified in their config file).
get a list of reads using samtools view BAM_FILE | cut -f 1 and remove the reads from the fastq file (maybe using iu-remove-ids-from-fastq (if this is addressed: https://github.com/merenlab/illumina-utils/issues/19)).

Some notes on configurations and other internal debates between Alon and himself

By default the contamination will be removed using the human genome, and there will be a command to setup the human genome in a known (i.e. known to anvi'o) location (maybe anvio/data/misc/human_genome/?). Otherwise, the user could provide a references-for-removal.txt with one line and two columns: reference name, path to reference fasta file. That way if you use mouse metagenomes for example you could provide a mouse reference genome for contamination removal. Multiple contamination removal references would be allowed. We will add an example/examples to our tutorial of how to use this feature with the human reference genome GRCh38 ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz.

The results of the contamination removal (if it was done) would be added to the qc-report.txt. Currently, the final number of reads is in the column total pairs passed/total pairs passed (percent of all pairs). So now total pairs passed will still show the final number of reads (so after QC AND after contamination removal), and total pairs passed QC will show the number of reads that passed QC. In addition, there would be two more columns: number of pairs removed due to reference contamination removal, and number of pairs removed due to reference contamination removal (percent of all pairs)

The no-remove flag, if set to true in the config file, will make anvio count the number of reads that mapped to the references-for-removal, but it will not remove the short reads from the fastq file. The motivation for using this flag is if you want to use this information for a better relative abundance estimation, but you don't want to risk damaging your assembly, AND you don't mind paying the price of having to assemble reads from contamination.

xvazquezc commented 6 years ago

That's pretty much what DeconSeq does. It should be done prior assembly. I always use Anvi'o after assembly so not sure about its placement in the workflow

ppflrs commented 6 years ago

Hi! I would prefer that anvio read by default contamination_reference.txt but it's just my bias of working with non-human related metagenomes.

Another option to generate "cleaned" fastq files could be the use of the bowtie2 flags --un and --un-conc (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml).

meren commented 6 years ago

I agree with @ppflrs. It is best if the genome(s) for contamination removal uses one or more user-defined FASTA files for mapping-based X removal.

I put an X there because I am not sure if whether that X should be short reads or assembled contigs. I am not 100% sure if removing short reads based on their similarity against a reference genome before the assembly is a good idea, and whether it would have a negative impact on the assembly. I may be wrong about this. On the other hand working with contigs would require great care so a long contig is not removed due to some short but significant alignment to a genome.

meren commented 6 years ago

We could change the title to "Add reference-based short read removal to the anvi'o snakemake metagenomics workflow". Because it doesn't have to be host, and it doesn't have to be contamination either :)

ShaiberAlon commented 6 years ago

Hi,

Thank you @ppflrs, and @xvazquezc for your responses! I accept the suggestion of always requiring a contamination_reference.txt (which from here on I will call references-for-removal.txt, until someone comes up with a better name). As for doing multiple references and not just one, no problem too.

using contigs instead of short reads

@meren, I see the advantage of doing this kind of thing after the assembly. The disadvantage is that we risk having the assembly step "waste" a lot of time assembling something that we might consider as contamination.

If we do this kind of filtering on the assembly then I think we should still include these contigs for the mapping/profiling, since part of the motivation to include this "reference-based short read removal" is so that we can have better estimates for relative abundance. What we could do is create a REFERENCES_FOR_REMOVAL collection, and in it we will have an automatically generated bin for each of the references. That way, we can later use the coverage information of these bins for relative abundance estimations.

Alon's bottom line

I'm currently inclined to implement the short read removal before assembly, and add another configuration so that the user could allow the mapping to be done just for the sake of estimation, and without actually removing anything from the fastq file (so that assembly is not hurt, but you can still improve relative abundance estimation).

I admit that a major part of my inclination to start with this, is because this is straight forward to implement, while the removal of contigs after assembly seems to me like something that requires more delicate handling, and hence maybe we should keep that for a future enhancement of the workflow.

meren commented 6 years ago

I'm currently inclined to implement the short read removal before assembly

I am OK with this, and I agree that it might be useful to simplify the assembly step (having this would also help us do experiments regarding to what extent this optional step impacts the assembly of target populations).

Keeping the actual number of original reads for better relative abundance estimations is a good idea. I perhaps wouldn't complicate this with additional configurations. if there is a file for references for read removal, the workflow would store the actual number of reads, and then remove short reads from the input data, and wouldn't look back ever again.

brymerr921 commented 6 years ago

I was looking at this link by Brian Bushnell recently. He describes some improvements over just mapping to a reference genome (in this case, human). I'm not sure this would fit within the workflow, but some of the arguments are compelling. Specifically, sequences conserved within humans and other eukaryotes may be filtered out by mapping to the human genome.

meren commented 6 years ago

Thank you, @brymerr921! Fair enough :)

ShaiberAlon commented 6 years ago

This is done, and the new details are in the tutorial: http://merenlab.org/2018/07/09/anvio-snakemake-workflows/#reference-based-short-read-removal

Thanks for your suggestions! If you have any additional remarks, please let me know.

alienzj commented 5 years ago

Hello, @ShaiberAlon , I have a question I would like to ask you.

bwa mem -t {threads} {params.prefix} {input.reads} |
tee >(samtools flagstat -@{threads} - > {output.flagstat}) |
tee >(samtools fastq -@{threads} -N -f 12 -F 256 -1 {output.reads[0]} -2 {output.reads[1]} -) |
samtools sort -@{threads} -o {output.bam} - 2>{log}

https://github.com/ohmeta/metapi/blob/dev/metapi/rules/rmhost.smk#L41 Is the above remove host sequence shell correct? Glad you can help me ：）

ShaiberAlon commented 5 years ago

Hi @alienzj ,

I am sorry for my delay in response. Unfortunately, I don't have enough familiarity with the specific samtools commands you are using to evaluate your command line.

In general what we do is map reads to the human genome and we then remove every read that maps to the human genome. If that's what your command does, then I it would be similar to what we do.

alienzj commented 5 years ago

@ShaiberAlon Thanks for your reply. Enjoy pipeline~

merenlab / anvio