hasindu2008 / squigulator

a tool for simulating nanopore raw signal data
https://hasindu2008.github.io/squigulator
MIT License
58 stars 3 forks source link

Keep simulating the first batch in "--full-contigs" mode #1

Closed youyupei closed 1 year ago

youyupei commented 1 year ago

Hi @hasindu2008,

This is Yupei. I am trying to simulate some reads with squigulator and I found it really useful and surprisingly easy to use. But I found something not super right when I do the simulation:

I have manually created ~20M perfect reads in fasta format (let's call it template.fa) and was trying to add some errors to them. I tried to run squigulator to simulator a squiggle for each read in template.fa using the command as follows:

squigulator template.fa  -o out_signal.blow5 --full-contigs

This gave me ~20M squiggles which match the number of input reads. But later I realised that squigulator was repeatedly simulating squiggles for the first 1000 recodes in the template.fa. I think this is related to the batch size (-K) and I tested to set -K 1 and it kept simulating the first read. Do you think this is something that should be fixed or have I done anything wrong?

Also, may I have another question: Do you have any suggestion if we want to tune the error rate (after basecalling) up/down by changing some parameters when running squigulator?

Regards, Yupei

hasindu2008 commented 1 year ago

Hi youyupei Thanks for finding this bug. You are right, just had a look through the code and I have forgotten to save the index of the generated read before going to the next batch. I will fix this tomorrow and let you know.

By the way, how did you generate the ~20M perfect reads? Did you randomly sample from a reference genome?

To get different error rates, you may vary the --dwell-std parameters, which is the amount of warping in the time domain. An increase in this warping increases the errors in the basecalls [see the analysis here https://github.com/hasindu2008/squigulator/blob/main/docs/dna.md#parameter-exploration].
I could also add an option to change the noise in the amplitude domain of the signal, if you want to try that.

hasindu2008 commented 1 year ago

@youyupei Check now, if it works as intended, please.

youyupei commented 1 year ago

Thanks very much! @hasindu2008 I will test it soon and let you know. Also, thanks for the suggestion about how to tune the quality of the simulated squiggles. I will definitely try to alter the --dwell-std. I guess it would also be helpful to have the option to change the noise in the amplitude domain of the signal as you mentioned.

Regarding your question about how did generate perfect reads. In my case, I have to add some specific barcodes to the reads, so I have to generate these perfect reads manually. I randomly sample from a reference transcript and add the barcodes.

hasindu2008 commented 1 year ago

I added the --amp-noise to the r10 branch https://github.com/hasindu2008/squigulator/tree/r10, but haven't yet tested if it works as intended (plan to test sometime this week). You could give it a try if you wish. Also, be a bit mindful of any anomalies when using this development branch as it has not been tested much.

Yes. that is the way to go. Squigulator can sample reads randomly from a given transcriptome but does not have a feature to attach barcodes. The --prefix option can only attach the adaptor and a random lengthed polyA. So manually making the reads is the easiest way. However, one limitation of simulating under the --full-contigs is that it loads the whole set of reads in your template.fa file to memory (I haven't yet implemented a separate code path, instead piggybacked into the already implemented reference loader). If template.fa is pretty large, it will need a huge amount of RAM. I intend to implement this properly at some point, but let me know if you hit a memory issue, so I can expedite this.

youyupei commented 1 year ago

Hi @hasindu2008, thanks so much for fixing the bug. I haven't tested the r10 branch, but the main branch is working well. Thanks very much!