mbhall88 / rasusa

Randomly subsample sequencing reads or alignments

MIT License

207 stars 17 forks source link

Generating only half of the target depth #64

Closed npdungca closed 9 months ago

npdungca commented 1 year ago

I used rasusa to subsample nanopore reads (200 to 600bp) by depth but I noticed that the observed average depths are usually half of the target depth. I used samtools depth to get the mean depth. Would you have an idea as to why this happens?

sample

for i in 19; do #depth for l in 5 10 20 30 100 400; do ./rasusa -i $datadir/barcode${i}duplex.fastq.gz \ --coverage ${l} --genome-size 243724 -s 100 \ -o $outdir/BC${i}${l}x.fq.gz

./rasusa -i $datadir/BC${i}_duplexCMVmapped.fastq.gz \ --coverage ${l} --genome-size 243724 -s 100 \ -o $outdir/BC${i}${l}x_mapped.fq.gz

Target depth	Observed mean depth
5	2.92247
10	5.05997
20	9.52989
30	14.0369
100	46.1916
400	183.38

Appreciate your help.

mbhall88 commented 1 year ago

How are you getting your observed mean depth? Is that from aligning reads to a reference genome?

If so, rasusa just subsamples the reads without any knowledge or care for whether those reads are from your genome of interest. So, for example, if you have 100 reads and only 50 map to your reference, it's quite probable that the resulting subsampled fastq will only have half the reads that align to the reference.

Does that make sense?

npdungca commented 1 year ago

I would use samtools depth -r [reference] to generate the observed mean depth.

The reads were subsampled from host-subtracted fastq.

mbhall88 commented 1 year ago

samtools depth requires a SAM/BAM/CRAM file as input and the -r option takes a BED file, not a reference genome?

My earlier comment still stands...

npdungca commented 1 year ago

-r Specify a region in chr or chr:from-to syntax So the identifier of the target reference (chr) was specified.

Thank you for your help.

mbhall88 commented 1 year ago

Ah yes. But again, it requires SAM as input. I'll refer back to me previous comment/question https://github.com/mbhall88/rasusa/issues/64#issuecomment-1738272793

How are you getting your observed mean depth? Is that from aligning reads to a reference genome?

If so, rasusa just subsamples the reads without any knowledge or care for whether those reads are from your genome of interest. So, for example, if you have 100 reads and only 50 map to your reference, it's quite probable that the resulting subsampled fastq will only have half the reads that align to the reference.

Does that make sense?

mbhall88 commented 9 months ago

Closing due to inactivity