goldman-gp-ebi / BOSS-RUNS

Dynamic, adaptive sampling during nanopore sequencing
GNU General Public License v3.0
26 stars 6 forks source link

No improved enrichment for ch10 compared to control with BOSS-RUNS using prom bulk data #8

Open DanteV19 opened 3 days ago

DanteV19 commented 3 days ago

Dear developer,

I have been having some unexpected results from BOSS-RUNS with my own prom bulk file (5khz) for playback. I would like to ask some questions to make sense of it.

The bulk file was generated from Promethion during a seq run with expression data in the form of cDNA.

I tried to enrich for chr10 to test whether the BOSS-RUNS adaptive sampling performed well. See below for the toml file I used.

# Basecaller configuration
[caller_settings.dorado]
config = "dna_r10.4.1_e8.2_400bps_5khz_hac"
address = "ipc:///tmp/.guppy/5555"
debug_log = "live_reads.fq"

# Aligner Configuration
[mapper_settings.mappy_rs]
fn_idx_in = "/data/GCF_000001405.40/human_ref.mmi"
debug_log = "live_alignments.paf"
n_threads = 4

[[regions]]
name = "hum_test"
min_chunks = 1 
max_chunks = 4 
targets = ["chr10"] 
single_on = "stop_receiving" 
multi_on = "stop_receiving"   
single_off = "unblock"        
multi_off = "unblock"         
no_seq = "proceed"            
no_map = "proceed"           
above_max_chunks = "unblock"
below_min_chunks = "proceed" 

[[regions]]
name = "control"
control = true
min_chunks = 0
max_chunks = 2
targets = []
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "stop_receiving"
multi_off = "stop_receiving"
no_seq = "stop_receiving"
no_map = "stop_receiving"
above_max_chunks = "stop_receiving"
below_min_chunks = "stop_receiving"

However, the results from the readfish stats command showed less enrichment for ch10 for the _humtest run (Median read lengths=626b, N50=2.08Kb) than the control run (Median read lengths=1.12Kb, N50=2.49Kb). Besides, looking at the N50 of all other chromosomes for _humtest, the non-target reads do not seem to be unblocked.

For the complete readfish stats output from the BOSS-RUNS run see the file below. readfish_stats_22oct_boss_ch10_9q_own_bulk_2h.txt

I have also done an adaptive sampling run with Readfish itself using the same toml and prom bulk file and this time the readfish stats results did show the non-target reads being unblocked. readfish_stats_21oct_prom_ch10_9q_own_bulk_2h.txt But Readfish's results also showed less enrichment for ch10 in _humtest run (Median read lengths=835b, N50=1.68Kb) than in control (Median read lengths=1.13Kb, N50=2.53Kb). My expectation was that I should have had higher enrichment for ch10 in _humtest than the control condition for both BOSS-RUNS and Readfish and that BOSS-RUNS would have unblocked the non-target reads. Is there an explanation why there would be less enrichment for ch10 in _humtest than in control for both BOSS-RUNS and Readfish?

The readfish stats results also indicated more reads were being basecalled with Readfish (1473829 reads) than with BOSS-RUNS (1067543 reads) according to readfish stats results. Is the difference in sequencing data because of the stochastic nature of reads being reprocessed from the prom bulk file or are there other factors?

Additionally, I noticed the [[regions]] configurations from the readfish toml on github of BOSS-RUNS being slightly different from those of the toml on Readfish github. While BOSS-RUNS's toml configurations look like this:

[[regions]]
name = "bossruns"
min_chunks = 0
max_chunks = 2
targets = []
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "unblock"
no_map = "unblock"
above_max_chunks = "stop_receiving"
below_min_chunks = "proceed"

Readfish has the following differences in configuration:

no_seq = "proceed"
no_map = "proceed"
above_max_chunks = "unblock"

Why are the [[regions]] configurations from the toml different between Readfish and BOSS-RUNS?

Since BOSS-RUNS and Readfish were only tested with genomic data I have some questions regarding the appliance on expression data. Do you expect the adaptive sampling by either BOSS-RUNS or Readfish to be just as effective for expression data? If yes, is it sufficient to use just a bed file with the gene coordinates of the genes where the transcripts originate from? Should I make additional adjustments besides adding the bed file for the gene coordinates?

Please let me know if you need more information.

Thanks in advance

W-L commented 11 hours ago

Hi Dante,

Thanks for getting in touch. I'll try to address your questions and add some additional comments here and there.

The bulk file was generated from Promethion during a seq run with expression data in the form of cDNA.

Since your bulk file was generated from cDNA, read lengths will be relatively short (compared to genomic DNA). Recall that we always need to sequence some initial bases to map the read to the reference before a decision about unblocking can be made. That means for short reads the gains from adaptive sampling will be decreased, since reads might have finished going through the pore before they can be unblocked. We have not tested the minimum necessary read lengths, or whether RNA/cDNA sequencing will in general benefit from adaptive sampling if most transcripts are, say <2kb long.

However, the results from the readfish stats command showed less enrichment ... looking at the N50 of all other chromosomes for hum_test, the non-target reads do not seem to be unblocked.

For both the readfish and the bossruns experiments it looks like the median read lengths of off-target regions are shorter than for on-target, so reads are marked for unblocking. In the summary at the top it's 1.13 Kb | 527 b and 1.15 Kb | 756 b respectively. It does seem that the unblocked reads from bossruns are a bit longer than using readfish alone, which indicates some longer processing times during which more sequence passes through the pore. That probably indicates that the compute is not keeping up or bossruns is trying to use too many threads.

Is there an explanation why there would be less enrichment for ch10 in hum_test than in control for both BOSS-RUNS and Readfish?

If by enrichment you mean the yield given in the readfish stats, those numbers are not meaningful for playback sequencing, since unblocking is not actually performed. Instead a read that's intended to be unblocked during playback will continously be unblocked until it ends its translocation in the original recording, if that makes sense. So yield can not really be measured during playback.

The readfish stats results also indicated more reads were being basecalled with Readfish (1473829 reads) than with BOSS-RUNS (1067543 reads) according to readfish stats results. Is the difference in sequencing data because of the stochastic nature of reads being reprocessed from the prom bulk file or are there other factors?

The first part here might have to do with the previous point, i.e. readfish performed more or faster unblocking (possibly repeated unblocking of the same original read) and therefore has more reads to basecall than bossruns. The playback procedure is not stochastic, it just replays whatever was recorded originally, i.e. all recorded reads go through the pore exactly the same time after the start of the experiment as originally.

Why are the [[regions]] configurations from the toml different between Readfish and BOSS-RUNS?

Setting this configuration is up to the experimental goals. The individual options what to perform at each event are described in https://github.com/LooseLab/readfish/blob/main/docs/toml.md The differences you mention:

no_seq = "proceed"              -> if a read does not produce any sequence during basecalling we get a bit more sequence. This should not apply to many reads in practice.
no_map = "proceed"              -> if a read does not map we get a bit more sequence and try again. Setting 'unblock' is more drastic, but will save more time.
above_max_chunks = "unblock"    -> once we've seen the max number of chunks (and still don't have a decision) we unblock it in this case. Should hopefully also not apply to many reads.

Do you expect the adaptive sampling by either BOSS-RUNS or Readfish to be just as effective for expression data?

I partially addressed that above. To add, you can think of the possible enrichment as the time you save by not sequencing things that you don't want. If it takes, say 300-400nt, of sequence to figure out if we want a read, plus some time to send that decision to the sequencer, plus the (constant) time used to eject reads, then the entire process might take the equivalent time to (or in the worst case longer than) sequencing the read in the first place. I think in a scenario with long transcripts and a well tuned adaptive sampling system (minimal basecalling and processing times) there might be an advantage to it, but not in the same way as for long genomic DNA.

Hope this helps!