jade-hpc-gpu / jade-hpc-gpu.github.io

Joint Academic Data Science Endeavour (JADE) is the largest GPU facility in the UK supporting world-leading research in machine learning (and this is the repo that powers its website)
http://www.jade.ac.uk/
Other
24 stars 8 forks source link

Very late job progress regarding Cuda/9.0 in JADE #119

Open ece7048 opened 5 years ago

ece7048 commented 5 years ago

I have an issue with the JADE job batch. I think that you did some kind of update and now when i need to load the cuda 9 module instead of doing: "module load cuda/9.0/bin/" I do "module load cuda/9.0/".

The main issue is that before when I run my script of a specific job it ends at worst in 20 hours. Now the same script runs only the 0.1% of the final progress in 20hours .

@twinkarma

twinkarma commented 5 years ago

JADE has been quite busy lately during the upgrade but I agree that the speed seems far to slow. Could you provide more information on the job you're trying to run? Are you able to share your batch script?

ece7048 commented 5 years ago

@twinkarma Please find attached the batch script.

Thank you ES_TL_endo_da.txt

twinkarma commented 5 years ago

There's nothing in the batch script that jumps out at me as to the cause of the slowdown.

Did you change anything between these two jobs? Have you tried running other jobs and have they also been slow?

Do you also have a job ID for the one that was running slowly? If you have an id for the one that ran normally as well that'd be helpful.

ece7048 commented 5 years ago

Dear @twinkarma , The issue is that the job I ran, it was running very quick as i said in the previous version Cuda/9.0/bin. The same exactly batch, I tried to run in Cuda/9.0/ and it took too much time.

I change the anaconda environment as I had some issues with python3. The error message was:

WARNING: python3/3.6.3 cannot be loaded due to a conflict. HINT: Might try "module unload python3" first.

I tried and other jobs. All of them took too much.

unfortunately I do not have the old ID . The new is 418011. However, I have the outputs. Please find attached the old and new output of the jobs in 20 hours running. old_job.txt new_job.txt

twinkarma commented 5 years ago

@ece7048 I've reported your issue through the hartree service portal, you should also be able to view. You should be able to see the issue if you log in here https://stfc.service-now.com/hartreecentre, and check "Opened issues". Let me know if you can't see it.