Threads or processor options for slide_variants function

kwonej0617 commented 1 year ago

Hi, Thank you for providing a useful tool. I have run Epinano with my dataset. I found the step of Slide_Variants took too much time to get the result. I was wondering if there is a threads or processor option that help reduce processing time and get the output faster.

Thank you!

Huanle commented 1 year ago

Hi @kwonej0617 ,

As far as i can remeber, this step is quite fast. Can you tell me the size of your input file and the relevant command line?

Best, Huanle

kwonej0617 commented 1 year ago

@Huanle Thank you for your response.

The size of bam files that took a long time to process Slide_Variants is 937M and 1.4G. They took around ~96 hours.

Thank you.

Huanle commented 1 year ago

thanks @kwonej0617 , do you mind sharing with me the input file to slide_variants?

kwonej0617 commented 1 year ago

Sure. You can use the following link to download the data. Please let me know if you are unable to access it. https://drive.google.com/drive/folders/1PlxuD0YLRN-U6tU4mHqagUGNyJ3OrBkM?usp=drive_link

The file is in gzipped format, but when I run slide_variants. I used the decompressed format.

Thank you so much for your help.

Huanle commented 1 year ago

I have requested to download the data but have not been approved yet. Can you approve my request so that i can move forward?

Cheers - Huanle

On Mon, Jun 26, 2023 at 8:06 AM kwonej0617 @.***> wrote:

Sure. You can use the following link to download the data. Please let me know if you are unable to access it.

https://umassmed-my.sharepoint.com/:f:/r/personal/euijin_kwon_umassmed_edu/Documents/[Epinano](https://umassmed-my.sharepoint.com/:f:/r/personal/euijin_kwon_umassmed_edu/Documents/Epinano?csf=1&web=1&e=ZtXvlA)?csf=1&web=1&e=ZtXvlA

The file is in gzipped format, but when I run slide_variants. I used the decompressed format.

Thank you so much for your help.

— Reply to this email directly, view it on GitHub https://github.com/novoalab/EpiNano/issues/136#issuecomment-1606347601, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG57ELQ6PPUCVY3IFGQ74TXNDHB7ANCNFSM6AAAAAAY7HR37U . You are receiving this because you were mentioned.Message ID: @.***>

--

kwonej0617 commented 1 year ago

@Huanle I changed the access permisison if you use the link below. Could you please try it again? https://drive.google.com/file/d/1jwl56Q1WhhXUuRFvhO8x3Apmf9fHGa0D/view?usp=sharing

Thank you so much for your help!

doshirLV commented 1 year ago

Hello EpiNano developers,

I am also wondering about this. Processing the bam file with $EPINANO_HOME/Epinano_Variants.py is fairly fast (within a day) but then running slide variants to get kmers for the plus and minus strand sample_strand.per.site.csv files is taking about 2 days each.
My command is: python /path/to/EpiNano/misc/Slide_Variants.py sample.minus_strand.per.site.csv 5
The input .csv's are about 5-600 MB each.
Is there any way to speed this up by providing more cores?
Please, kindly let me know at your earliest convenience.

Thank you for the assistance, Raj

Huanle commented 1 year ago

Hi Raj,

Sorry for the late reply. I have been occupied by other tasks. Now i am working on it.

Best regards, Huanle

On Fri, Aug 18, 2023 at 5:48 AM Raj Doshi @.***> wrote:

Hello EpiNano developers,

I am also wondering about this. Processing the bam file with $EPINANO_HOME/Epinano_Variants.py is fairly fast (within a day) but then running slide variants to get kmers for the plus and minus strand sample_strand.per.site.csv files is taking about 2 days each.

The input .csv's are about 5-600 MB each.

Is there any way to speed this up by providing more cores?

Please, kindly let me know at your earliest convenience.

Thank you for the assistance, Raj

— Reply to this email directly, view it on GitHub https://github.com/novoalab/EpiNano/issues/136#issuecomment-1683023292, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG57EJZNUKSMOB4EYPT3C3XV2GSFANCNFSM6AAAAAAY7HR37U . You are receiving this because you were mentioned.Message ID: @.***>

--

Huanle commented 1 year ago

@kwonej0617 I have committed a new version of slide_variants.py. Once it is admitted by the owner of this repo, you can give it a go.

enovoa commented 1 year ago

Can you please test the new slide_variants script from EpiNano 1.2.3? Thanks

doshirLV commented 12 months ago

Hello Huanle,

Thank you for your help. The new version has currently also been running for 2 days. I started Slide_Variants.py on Wednesday morning and it is still running as of Friday afternoon (PT). I used the same command as before. Let me know if there is an option I can use to specify the amount of threads/cores. Or if there is anything else I can do to speed this up.

Much appreciated, Raj

Huanle commented 12 months ago

Hi @doshirLV ,

I do not think the script has been successfully committed to github. Therefore, I attached it here. Please change it a python script before using it. Let me know if you encounter any issues.

Best, Huanle

Slide_Variants.txt

doshirLV commented 12 months ago

Dear Huanle,

This new version is much faster. It completes in less than an hour. The new Slide_Variants.py script should be committed to the EpiNano github page since it severely improves the tool.

One note about the script, though:

On line 10, it mentions that the script was created to replace Epinano_Variants.py but it should say "slide_variants" instead, if I am not mistaken.

Also a question about this version:

There is an additional output file with this new script called sample.plus_strand.per.site.csv.non-consecutive-sites. What is this file for? Is it used in any downstream steps (i.e. Epinano_Predict.py)? Do I need any additional information from this file to do my analysis?

With gratitude, Raj

Huanle commented 12 months ago

Hi @doshirLV ,

Your note is right. It's to replace slide not var.

Regrading the non-consecutive-sites file, it contains information that can not be used to construct windows/kmers, because it does not have information associated with sites/positions right next to it. This means you can ignore/delete the file or comment out the script that generated it.

I will ask Eva to commit the relevant script.

Best, Huanle

novoalab / EpiNano

Threads or processor options for slide_variants function #136

One note about the script, though:

Also a question about this version: