gymreklab / chips

Simulation tool for ChIP- and other -seq experiments
GNU General Public License v3.0
12 stars 3 forks source link

How many reads? #66

Closed ivargr closed 1 year ago

ivargr commented 1 year ago

Hi!

I'm trying to use chips for simulating chip-seq reads. My input is a set of simulated peaks (simulated by other software, so that I know the peak position), and both the genome size and number of peaks vary greatly.

I'm having trouble setting parameters for chips that seem to make meaningful simulated reads. Specifically, I have no idea what to set --numreads to. I assume the number of reads ideally should be a function of genome size and number of peaks, and I would have hoped that the software would be able to find out how many reads to simulate. Do you have any suggestions on how to set number of reads?

I'm also unsure about --frac. Is there a way to compute what --frac should be base on my input peaks? I'm guessing maybe number of peks x fragment size / genome size, but not sure if that is correct.

Pandaman-Ryan commented 1 year ago

Hi @ivargr

Thank you for your interest in our work.

"--numreads" can be any number you would like to use. We have 1,000,000 as default. This parameter simulates the fact that you need to specify how many reads you would like to generate when using a sequencer. If the reads are concentrated in the peak regions (i.e. --spot is high), you can estimate numreads as total-length-peak-regions * expected read depth / read-length. Otherwise, you can start with a smaller --numreads value, inspect the read depth for peaks, and scale up the --numreads parameters accordingly.

"--frac" is the fraction of the genome that is bound. The length of peaks is computed with the input file specified by "-t" and the length of the genome is computed with the input file specified by "-f". If you are not sure what "--frac" to use but have example reads available, you can use our "chips learn" function to compute "--frac" and other parameters for you.

Please let me know if you have any other questions.

Cheers, -A

ivargr commented 1 year ago

Thanks a lot for the explanation, things make much more sense now.

I think one of the problems I had was not specifying --spot. When I increased the number of input simulated peaks without changing --spot, the peaks got too few reads.