rrwick / Badread

a long read simulator that can imitate many types of read problems
GNU General Public License v3.0
167 stars 22 forks source link

Multi-threading #7

Open baraaorabi opened 4 years ago

baraaorabi commented 4 years ago

Is your feature request related to a problem? Please describe.

The simulator is very slow when it comes to

Both of these steps should have straighforward data parallelism

Describe the solution you'd like Multithreading of the two steps (and possible others?)

Describe alternatives you've considered Adding a program command to prepare the reference contigs and pickle the results so rerunning won't be slow. That won't really resolve the read generation speed thu

Additional context I am building a wrapper around Badread for transcriptomic reads. It's still in the design stage. I plan to code the multithreading described above on a separate branch and make PR

W-L commented 3 years ago

For others stumbling across this issue, here's a little snakemake template that mimmicks multi-threading by running badread multiple times and concatenating the fastq files at the end.

threads = list(range(10))
genome = "genome.fa"

# example to pass through parameters
rlen_mean = 15000
rlen_sd = 13000
sim_params = {"rlen": f"{rlen_mean},{rlen_sd}"}

rule all:
    input:
        expand("reads_{t}.fq", t=threads),
        "sim_reads.fq"

# run badread simulate multiple times on the same input genome
rule badread_sim:
    input: genome
    output: "reads_{t}.fq"
    params:
        rlen = lambda wildcards: sim_params['rlen']
    shell:
        "badread simulate --reference {input} --length {params.rlen} >{output}"

# afterwards simply concatenate all output read files
rule concat_sim:
    input: expand("reads_{t}.fq",t=threads)
    output: "sim_reads.fq"
    shell:
        "cat {input} > {output}"

Just saw that there is already a wiki entry for doing exactly the same thing in bash. Anyway, maybe this is still useful for someone.

jsgounot commented 2 years ago

Before anyone do the same thing that I did and follow blindly W-L's answer, note that doing so will in some occasion generate the same read name multiple times. This might affect your pipeline, especially if you're cleaning your reads later since minimap2 do not care if multiple reads with the same name appear, and will just map them individually, leading to secondary / chimeric alignments.

mbhall88 commented 1 year ago

@jsgounot did you get the same read name multiple times? If so, you should buy a lottery ticket as the read names are generated with uuid https://github.com/rrwick/Badread/blob/09fb3082e5b2530c4e17e20e262ff227eb28ff13/badread/simulate.py#L77

jsgounot commented 1 year ago

I know but I'm not as lucky with the lottery sadly ...