huishenlab / biscuit

BISulfite-seq CUI Toolkit
Other
16 stars 7 forks source link

Remove GNU parallel dependency #33

Closed semenko closed 1 year ago

semenko commented 1 year ago

Use bash builtins instead of GNU parallel.

This is slightly faster than the parallel implementation.

Output identical, compared with diff -bur ../biscuit-qc/ ../biscuit-qc-patch-1/

semenko commented 1 year ago

Hey @jamorrison is it useful for me to updated this — or should I close this PR?

(Don't see a good reason to use parallel at all here.)

jamorrison commented 1 year ago

@semenko I think this was a good update. I noticed this after I started the release process for BISCUIT v1.2.0 though, so it'll go into the next release of BISCUIT. Thanks for the PR!

semenko commented 1 year ago

Thanks! Does this latest version (1.2.0) auto-push to bioconda? (Exciting changes!)

jamorrison commented 1 year ago

I have a PR in testing for bioconda at the moment. Should be available in a day or two, review and approval dependent.

jamorrison commented 1 year ago

Hi @semenko The latest version of biscuit (1.2.0) is available on bioconda now.

bounlu commented 1 year ago

Hi,

I am curious to know why bash builtin background processing should be faster than GNU parallel?

If not slower, I would still go with GNU parallel as it has better capabilities and handling with parallel execution although this is not very much applicable in this case.

semenko commented 1 year ago

For a few reasons, parallel was forcing extra work. For example this:

https://github.com/huishenlab/biscuit/pull/33/files#diff-03f6544f78af96f7e16c293b67a23d876cd5a1d585d61658675c2b282a6c552dL136 image

parallel was running samtools view, then we running it again multiple times for quality filters (and other options) not implemented by bedtools commands.

Plus no need for extra dependencies :P

bounlu commented 1 year ago

Ahh I see, sorry that I missed the extra code. This is because parallel requires an input sequence to work with. However, this can be tricked like here to make it equivalent to background processing you implemented.

And I would not really call awk and parallel as extra dependencies as they are almost always built-in ;)