jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

Speed up spawning and Python on HPC #190

Open statiksof opened 4 years ago

statiksof commented 4 years ago

Dear Batchspawner developers/users,

this is not an issue, just asking for best practices/advices. Technically, our Jupyterhub (1.1.0) deployment on HPC using batchspwner (1.0.0) works fine. However, there are still some drawbacks that reduce the use of the server.

  1. Spawning notebooks is taking so long even if computing nodes already affected to the Slurm job
  2. Loading/importing python packages is very slow

Anyone facing this problem? is there a way to speed up those processes on HPC?

For information, we use modules for python packages and also jupyterhub-singleuser is located in a conda env.

Thank you in advance.

rcthomas commented 4 years ago

What kind of file system is your conda environment installed on, and how is it mounted from your compute nodes?

statiksof commented 4 years ago

@rcthomas We have Lustre file system. Also cvmfs for softwares.

rcthomas commented 4 years ago

Some more questions:

statiksof commented 4 years ago
  • Is the conda environment you're testing with on Lustre or is it on the CVMFS filesystem, and if you have your conda environment on both do you notice a performance difference for startup between them or is it the same story on both?

The module used for jupyterhub-singleuser and also the system kernels (Python 3) are both in cvmfs.

  • Are you using JupyterLab or classic notebook, and have you compared whether you get a different response from one or the other? If JupyterLab does the progress-bar complete but then you are just left sitting there while nothing happens for a while but then suddenly the Jupyter moons show and slowly the JupyterLab UI comes to life?

Both, also users can select between classic and lab at the start. The response is almost the same, but I nonticed that switching between kernels might be faster when using jupyter lab. JupyterLab UI starts normaly, the problem is when starting the kernel which is annoying.

  • Can you quantify the wait time you're talking about? 30 seconds, or more like 300? Have you compared the timestamp from the Slurm job start (e.g. with sacct) and first timestamp from jupyterhub-singleuser for that figure?
    • For your problem 2 can you again quantify how slow is very slow, and is this just one Python process or is it e.g. a multiprocess kind of job like with mpi4py?

I'll back to you with screenshots to have an idea. The main problem is when importing a library for the first time.

rcthomas commented 4 years ago

OK, so, it sounds like you don't have a problem with the time between getting your job allocation from Slurm and jupyterhub-singleuser start up, it's just when you select a kernel that you see things be painfully slow. (Your first issue was about spawning notebooks, which I took to mean spawning jupyterhub-singleuser). It sounds like really there's no Jupyter problem here per se, just Python being slow about loading dependencies from CVMFS? Do you see the same behavior when you try to import packages from your CVMFS-based conda env from just a regular script?

statiksof commented 4 years ago

OK, so, it sounds like you don't have a problem with the time between getting your job allocation from Slurm and jupyterhub-singleuser start up, it's just when you select a kernel that you see things be painfully slow. (Your first issue was about spawning notebooks, which I took to mean spawning jupyterhub-singleuser).

Yes, this what I mean by computing nodes affected to Slurm job. It's more on Python, I thought that other data centers are facing similar issues. Here for example a simple import of xarray:

import

rcthomas commented 4 years ago

Thanks for clarifying. There is a widely-known issue with serving Python software stacks from parallel file systems but it's more of a problem at scale or when the file system metadata servers have a lot of traffic. There are several mitigations like caching metadata on the compute node, moving the software stack to the compute node ram disk or local node storage, etc. The best MD caching-based performance we've seen at our center is from Cray DVS read-only mounting GPFS with client-side caching turned on, but this doesn't match the node-based solutions (including containers).

I don't know enough about CVMFS to help you, but there are probably other contributors here who can recommend tunings.

statiksof commented 4 years ago

Thanks @rcthomas for your contribution. We can leave this open now, maybe you can label it as discussion.