Is estimating enrichment with playback possible by calculating remaining bases from unblocked reads?

Dear Lukas,

In an earlier issue you mentioned that actual enrichment with playback is not possible due to the current being immutable. However, I would still like to estimate the enrichment somehow with playback.

Regarding enrichment you also mentioned that you can think of possible enrichment as the time you save by not sequencing things that you don't want.

With this in mind I came up with a method to estimate enrichment by calculating the remaining bases of unblocked reads from adaptive sampling. Although I am not certain if this method is the correct way to estimate enrichment, so I would like your thoughts about it.

I will explain how I intend to calculate the remaining bases.

When performing readfish stats you can demultiplex the reads into _proceed.fastq.gz, stopreceiving.fastq.gz and *unblock.fastq.gz_ in my case for control and hum_test.

Since there was no documentation about how these files were made, I assume _hum_testunblock.fastq.gz contains all complete reads that were supposedly "unblocked" (since I used playback).

Additionally, during execution of readfish the output of individual reads is generated in _livereads.fq.

# Fastq output for individual reads
debug_log = "live_reads.fq"

I noticed that in _livereads.fq there were multiple fragments with the same read IDs as the complete reads in _hum_testunblock.fastq.gz. So, I assume that complete "unblocked" reads were segmented into unblocked chunks in _livereads.fq, with the end of the first segment being the location of when the rejection signal was sent. If my assumptions are correct then I should be able to estimate enrichment by calculating the remaining bases of "unblocked" reads as: length of complete "unblocked" reads (from _hum_test_unblock.fastq.gz_) - length of the first segment of corresponding read (from live_reads.fq) = remaining bases of the "unblocked" read

Subsequently, the average remaining bases of the "unblocked" reads can be calculated by summing up the above for all unblocked reads and dividing it by the number of "unblocked" reads.

Since the average of remaining bases could indicate the saved time during adaptive sampling can it also be used as an estimation of enrichment?

If you notice that any of my assumptions are wrong please let me know and if possible advise me on any other way to estimate enrichment with playback.

Thanks in advance

goldman-gp-ebi / BOSS-RUNS

Is estimating enrichment with playback possible by calculating remaining bases from unblocked reads? #9