linsalrob / ComputationalGenomicsManual

Robs manual for the computational genomics and bioinformatics class.
https://linsalrob.github.io/ComputationalGenomicsManual/
MIT License
205 stars 58 forks source link

Filtering host reads #5

Open chrissy005 opened 2 months ago

chrissy005 commented 2 months ago

Hello, I was attempting the following codes as you described to filter out host sequences:.

"host sequences: mkdir host not_host samtools fastq -F 3588 -f 65 output.bam | gzip -c > host/output_S_R1.fastq.gz echo "R2 matching host genome:" samtools fastq -F 3588 -f 129 output.bam | gzip -c > host/output_S_R2.fastq.gz

sequences that are not host: samtools fastq -F 3584 -f 77 output.bam | gzip -c > not_host/output_S_R1.fastq.gz samtools fastq -F 3584 -f 141 output.bam | gzip -c > not_host/output_S_R2.fastq.gz samtools fastq -f 4 -F 1 output.bam | gzip -c > not_host/output_S_Singletons.fastq.gz"

I am new to samtools and do not understand the -F and -f flags as well as the integers that follow them. Do these determine which sequences are host and non-host?

bartns commented 1 week ago

From samtools fastq help:

  -f, --require-flags INT
               only include reads with all  of the FLAGs in INT present [0]
  -F, --excl[ude]-flags INT
               only include reads with none of the FLAGs in INT present [0x900]

And this might help you out regarding the SAM flag values:

https://www.samformat.info/sam-format-flag And I like to use this one: https://broadinstitute.github.io/picard/explain-flags.html

For the manual sake it be might be nicer to use the full option names (--require-flags, --exclude-flags)