Making sense of simulation duration vs sequencing duration

DepledgeLab commented 1 year ago

Hello,

This is a really lovely tool and an elegant solution to the problem of sequencing short reads by DRS. We have been testing this in our lab by performing 18-24 hrs sequencing runs, saving the bulk output, and then running the modified (short) analysis using the method you describe.

Initially, we would specify the run time for the simulation to be the same as the original run (e.g. 24 hrs) but we then started testing long run times in the simulation (e.g. 72 hrs) and observed a huge increase in the number of reads obtained (e.g. an increase from 500k at 24 hrs to 1500k at 72 hrs, despite the original run only being 18 hrs). Do you have any idea why this happens and did you observe the same for your own datasets?

enovoa commented 1 year ago

Hi @DepledgeLab thanks for sharing your thoughts and your comments :)
Most of the benchmarking simulations that we performed were with tRNAs. However, if the simulation is done on samples that also include longer RNA reads, my guess is that RNA molecules might get partly degraded with time (as they are at 34ºC or so), which can explain the increase in capturing a higher proportion of shorter reads with increased duration of the runs.

However, what you mention above is slightly different - you are referring to doing a simulation on an 18h sequencing run for 72h?

lpryszcz commented 1 year ago

Hi @DepledgeLab , to be honest I never ran simulations for longer than the sequencing run.

I guess MinKNOW may start running the simulations from the beginning if it passed the end of the bulk file (otherwise it'd have to throw some end of file error). Your numbers actually suggest it: you got 500k reads from 24h simulation and 1,500k from 72h (thus precisely 3x more reads from 3x longer simulation). Note, additional reads are most likely redundant, although they may have different read IDs and even slightly different squiggle, because MinKNOW read start-end definition is somewhat stochastic - we always got slightly different results, even when starting simulation from the same device on the same computer!
Thus, I'd strongly recommend running the simulation only for 18h (the runime of the original sequencing).

Anyway, please reach out to ONT or community regarding this matter, since those are only guesses.

Hope it helps!

novoalab / Nano-tRNAseq

Making sense of simulation duration vs sequencing duration #6