qqwang-berkeley / JUM

A tool for annotation-free differential analysis of tissue-specific pre-mRNA alternative splicing patterns
MIT License
27 stars 11 forks source link

Adding threads to JUM_2-3.sh job submission #15

Closed binasu closed 6 years ago

binasu commented 6 years ago

Hi,

Thank you so much for this great tool and the very intuitive instructions on how to use it. I'm on the JUM_2-3.sh running step and it's taking over 8 hours to run. I was wondering if there is any way to add threads to the JUM_2-3.sh step?

Thank you, Bina

qqwang-berkeley commented 6 years ago

JUM_2-3.sh is written in linux bash script and bash script doesn't really allow multi-threading. However, there is a different way in this language that permits multi-processing, just like multi-threading does.

I will adjust the script so that it does that. Will upload the newest *.sh scripts within a day or two.

You are right that JUM_2-3.sh is the script that takes longest in the whole package. I am currently working on a big update for speeding and more user-friendly output format. The big update will be posted by mid July for sure.

Qingqing

On Mon, Jun 25, 2018 at 11:05 AM, binasu notifications@github.com wrote:

Hi,

Thank you so much for this great tool and the very intuitive instructions on how to use it. I'm on the JUM_2-3.sh running step and it's taking over 8 hours to run. I was wondering if there is any way to add threads to the JUM_2-3.sh step?

Thank you, Bina

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qqwang-berkeley/JUM/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/AZPn2zvDAODiKweqTxKWW-MqON45cuOcks5uASZ6gaJpZM4U2mqf .

binasu commented 6 years ago

Hi Qingqing,

Thank you for your prompt reply! I look forward to the update, and thanks for all your hard work. Bina

qqwang-berkeley commented 6 years ago

Hi Bina,

I have the modified JUM_2-3.sh ready. I was thinking about emailing it to you directly for your current needs. I am adding more stuff to it (like reporting time elapsed etc.) before I post JUM version 1.3.13 with it officially.

So here it is. You just need to copy the attached JUM_2-3.sh into your current JUM script folder with other JUM_2-1.sh, JUM_3.sh etc and replace the previous one and it should be good to go.

It should be significantly faster as this will now parallel process all the sam files simultaneously. For users that have a lot of input files, it should decrease the running time significantly. Do you have a lot of input files or do you have super deep sequencing data?

Let me know your experience :) I am actually gonna add more parallel processing steps to it using GNU parallel. Although I am still testing a chunk of that. But it should come out soon.

Feel free to ask any questions :)

Qingqing

On Thu, Jun 28, 2018 at 12:18 PM binasu notifications@github.com wrote:

Hi Qingqing,

Sorry to bother you but is has the script been updated by any chance? And, if so, where do I download it from and do I just upload it into the JUM folder?

Thank you, Bina

On Jun 25, 2018, at 2:36 PM, Qingqing Wang <notifications@github.com mailto:notifications@github.com> wrote:

JUM_2-3.sh is written in linux bash script and bash script doesn't really allow multi-threading. However, there is a different way in this language that permits multi-processing, just like multi-threading does.

I will adjust the script so that it does that. Will upload the newest *.sh scripts within a day or two.

You are right that JUM_2-3.sh is the script that takes longest in the whole package. I am currently working on a big update for speeding and more user-friendly output format. The big update will be posted by mid July for sure.

Qingqing

On Mon, Jun 25, 2018 at 11:05 AM, binasu <notifications@github.com<mailto: notifications@github.com>> wrote:

Hi,

Thank you so much for this great tool and the very intuitive instructions on how to use it. I'm on the JUM_2-3.sh running step and it's taking over 8 hours to run. I was wondering if there is any way to add threads to the JUM_2-3.sh step?

Thank you, Bina

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qqwang-berkeley/JUM/issues/15, or mute the thread < https://github.com/notifications/unsubscribe-auth/AZPn2zvDAODiKweqTxKWW-MqON45cuOcks5uASZ6gaJpZM4U2mqf>

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/qqwang-berkeley/JUM/issues/15#issuecomment-400052461>, or mute the thread< https://github.com/notifications/unsubscribe-auth/AmkdTs3tKnIUUCtisl521FWd_8pm5W-Qks5uAS21gaJpZM4U2mqf>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/qqwang-berkeley/JUM/issues/15#issuecomment-401144115, or mute the thread https://github.com/notifications/unsubscribe-auth/AZPn24sn95nNQhP4YOUG-wnQ6xYbiGgIks5uBSwOgaJpZM4U2mqf .

binasu commented 6 years ago

Hi Qingqing,

Thank you so much for your quick work! We have super deep sequencing data but we have several samples so I’m sure this will help with the time the runs take cumulatively. I’ll keep you posted.

Thank you, Bina

qqwang-berkeley commented 6 years ago

If you have super deeply sequenced datasets then hang on there. By the end of day I shall send you another script JUM_2-3.sh that utilizes multi-threading for processing sam files, so that it will make that time-straining step in JUM_2-3.sh significantly faster. I will give you options to choose number of threads and memory allocated to each thread. I hope you have a super computer cluster :)

I just need a little bit more time to make sure that the results do not change from utilizing threading. Almost there...

On Thu, Jun 28, 2018 at 1:12 PM binasu notifications@github.com wrote:

Hi Qingqing,

Thank you so much for your quick work! We have super deep sequencing data but we have several samples so I’m sure this will help with the time the runs take cumulatively. I’ll keep you posted.

Thank you, Bina

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/qqwang-berkeley/JUM/issues/15#issuecomment-401158688, or mute the thread https://github.com/notifications/unsubscribe-auth/AZPn2wxpSDIqd3HlRJwtS0zwB0j-Otwqks5uBTjDgaJpZM4U2mqf .

binasu commented 6 years ago

Great! Thank you so much. We are using a super computer cluster ;)

qqwang-berkeley commented 6 years ago

I am still testing one step for running time optimization in the JUM_2-3.sh code. I will get back to you tomorrow morning. Once validated, it shall decrease the running time to less than a quarter of the previous time. Stay tuned...

On Thu, Jun 28, 2018 at 1:25 PM binasu notifications@github.com wrote:

My email is bina.sugumar@mail.utoronto.ca. I'll delete this post when you're done

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/qqwang-berkeley/JUM/issues/15#issuecomment-401162245, or mute the thread https://github.com/notifications/unsubscribe-auth/AZPn23RheQ-glHawRjscs-NeUuRsn6LDks5uBTvKgaJpZM4U2mqf .

qqwang-berkeley commented 6 years ago

Hi Bina,

In the attachment in the newly upgraded JUM_2-3.sh script. Three major changes:

1) All sam files from your input are processed in parallel now 2) Multi-threading is applied to all sam/bam file processing with user-chosen thread number 3) The step that counts and quantifies for intron retention events is significantly optimized in running time.

This script in my hand decreases the original >10 hr running on eight samples to ~ 1hr, using two threads. I hope you will be pleasantly surprised by the speed too in your hand :)

Note, when you run JUM_2-3.sh, you need to provide another parameter at the very end, the thread number. For example:

bash /user/home/JUM_1.3.12/JUM_2-3.sh /user/home/JUM_1.3.12 5 2 5 100 2

This parameter indicates how many threads you want to use to process sam files. I would suggest you choose this parameter based on how many cores your CPUs have. Do keep in mind that all your input sam files are processed simultaneously. So if you choose the thread parameter to be 2 and you have 8 input samples, you are in fact consuming 2*8=16 threads on your computer cluster. So you want to choose the parameter so that the total number of threads is below the total number of cores you have in your CPUs.

Let me know how it goes. Thank you so much for the feedback and the suggestion to make JUM faster! I will later update other sh scripts the same way and post the version 1.3.13 online.

Qingqing

On Thu, Jun 28, 2018 at 1:25 PM binasu notifications@github.com wrote:

My email is bina.sugumar@mail.utoronto.ca. I'll delete this post when you're done

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/qqwang-berkeley/JUM/issues/15#issuecomment-401162245, or mute the thread https://github.com/notifications/unsubscribe-auth/AZPn23RheQ-glHawRjscs-NeUuRsn6LDks5uBTvKgaJpZM4U2mqf .