Closed ShaiberAlon closed 6 years ago
That's pretty much what DeconSeq does. It should be done prior assembly. I always use Anvi'o after assembly so not sure about its placement in the workflow
Hi! I would prefer that anvio read by default contamination_reference.txt
but it's just my bias of working with non-human related metagenomes.
Another option to generate "cleaned" fastq files could be the use of the bowtie2 flags --un
and --un-conc
(http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml).
I agree with @ppflrs. It is best if the genome(s) for contamination removal uses one or more user-defined FASTA files for mapping-based X removal.
I put an X there because I am not sure if whether that X should be short reads or assembled contigs. I am not 100% sure if removing short reads based on their similarity against a reference genome before the assembly is a good idea, and whether it would have a negative impact on the assembly. I may be wrong about this. On the other hand working with contigs would require great care so a long contig is not removed due to some short but significant alignment to a genome.
We could change the title to "Add reference-based short read removal to the anvi'o snakemake metagenomics workflow". Because it doesn't have to be host, and it doesn't have to be contamination either :)
Hi,
Thank you @ppflrs, and @xvazquezc for your responses!
I accept the suggestion of always requiring a contamination_reference.txt
(which from here on I will call references-for-removal.txt
, until someone comes up with a better name). As for doing multiple references and not just one, no problem too.
@meren, I see the advantage of doing this kind of thing after the assembly. The disadvantage is that we risk having the assembly step "waste" a lot of time assembling something that we might consider as contamination.
If we do this kind of filtering on the assembly then I think we should still include these contigs for the mapping/profiling, since part of the motivation to include this "reference-based short read removal" is so that we can have better estimates for relative abundance. What we could do is create a REFERENCES_FOR_REMOVAL collection, and in it we will have an automatically generated bin for each of the references. That way, we can later use the coverage information of these bins for relative abundance estimations.
I'm currently inclined to implement the short read removal before assembly, and add another configuration so that the user could allow the mapping to be done just for the sake of estimation, and without actually removing anything from the fastq file (so that assembly is not hurt, but you can still improve relative abundance estimation).
I admit that a major part of my inclination to start with this, is because this is straight forward to implement, while the removal of contigs after assembly seems to me like something that requires more delicate handling, and hence maybe we should keep that for a future enhancement of the workflow.
I'm currently inclined to implement the short read removal before assembly
I am OK with this, and I agree that it might be useful to simplify the assembly step (having this would also help us do experiments regarding to what extent this optional step impacts the assembly of target populations).
Keeping the actual number of original reads for better relative abundance estimations is a good idea. I perhaps wouldn't complicate this with additional configurations. if there is a file for references for read removal, the workflow would store the actual number of reads, and then remove short reads from the input data, and wouldn't look back ever again.
I was looking at this link by Brian Bushnell recently. He describes some improvements over just mapping to a reference genome (in this case, human). I'm not sure this would fit within the workflow, but some of the arguments are compelling. Specifically, sequences conserved within humans and other eukaryotes may be filtered out by mapping to the human genome.
Thank you, @brymerr921! Fair enough :)
This is done, and the new details are in the tutorial: http://merenlab.org/2018/07/09/anvio-snakemake-workflows/#reference-based-short-read-removal
Thanks for your suggestions! If you have any additional remarks, please let me know.
Hello, @ShaiberAlon , I have a question I would like to ask you.
bwa mem -t {threads} {params.prefix} {input.reads} |
tee >(samtools flagstat -@{threads} - > {output.flagstat}) |
tee >(samtools fastq -@{threads} -N -f 12 -F 256 -1 {output.reads[0]} -2 {output.reads[1]} -) |
samtools sort -@{threads} -o {output.bam} - 2>{log}
https://github.com/ohmeta/metapi/blob/dev/metapi/rules/rmhost.smk#L41 Is the above remove host sequence shell correct? Glad you can help me :)
Hi @alienzj ,
I am sorry for my delay in response. Unfortunately, I don't have enough familiarity with the specific samtools commands you are using to evaluate your command line.
In general what we do is map reads to the human genome and we then remove every read that maps to the human genome. If that's what your command does, then I it would be similar to what we do.
@ShaiberAlon Thanks for your reply. Enjoy pipeline~
I chose the following approach: after quality filtering, each fastq file would be:
references-for-removal.txt
samtools view BAM_FILE | cut -f 1
and remove the reads from the fastq file (maybe usingiu-remove-ids-from-fastq
(if this is addressed: https://github.com/merenlab/illumina-utils/issues/19)).Some notes on configurations and other internal debates between Alon and himself
By default the contamination will be removed using the human genome, and there will be a command to setup the human genome in a known (i.e. known to anvi'o) location (maybe
anvio/data/misc/human_genome/
?). Otherwise, the user could provide areferences-for-removal.txt
with one line and two columns: reference name, path to reference fasta file. That way if you use mouse metagenomes for example you could provide a mouse reference genome for contamination removal. Multiple contamination removal references would be allowed. We will add an example/examples to our tutorial of how to use this feature with the human reference genome GRCh38ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz
.The results of the contamination removal (if it was done) would be added to the qc-report.txt. Currently, the final number of reads is in the column
total pairs passed
/total pairs passed (percent of all pairs)
. So nowtotal pairs passed
will still show the final number of reads (so after QC AND after contamination removal), andtotal pairs passed QC
will show the number of reads that passed QC. In addition, there would be two more columns:number of pairs removed due to reference contamination removal
, andnumber of pairs removed due to reference contamination removal (percent of all pairs)
The
no-remove
flag, if set totrue
in the config file, will make anvio count the number of reads that mapped to the references-for-removal, but it will not remove the short reads from the fastq file. The motivation for using this flag is if you want to use this information for a better relative abundance estimation, but you don't want to risk damaging your assembly, AND you don't mind paying the price of having to assemble reads from contamination.