Open statiksof opened 4 years ago
What kind of file system is your conda environment installed on, and how is it mounted from your compute nodes?
@rcthomas We have Lustre file system. Also cvmfs for softwares.
Some more questions:
- Is the conda environment you're testing with on Lustre or is it on the CVMFS filesystem, and if you have your conda environment on both do you notice a performance difference for startup between them or is it the same story on both?
The module used for jupyterhub-singleuser
and also the system kernels (Python 3) are both in cvmfs.
- Are you using JupyterLab or classic notebook, and have you compared whether you get a different response from one or the other? If JupyterLab does the progress-bar complete but then you are just left sitting there while nothing happens for a while but then suddenly the Jupyter moons show and slowly the JupyterLab UI comes to life?
Both, also users can select between classic and lab at the start. The response is almost the same, but I nonticed that switching between kernels might be faster when using jupyter lab. JupyterLab UI starts normaly, the problem is when starting the kernel which is annoying.
- Can you quantify the wait time you're talking about? 30 seconds, or more like 300? Have you compared the timestamp from the Slurm job start (e.g. with sacct) and first timestamp from jupyterhub-singleuser for that figure?
- For your problem 2 can you again quantify how slow is very slow, and is this just one Python process or is it e.g. a multiprocess kind of job like with mpi4py?
I'll back to you with screenshots to have an idea. The main problem is when importing a library for the first time.
OK, so, it sounds like you don't have a problem with the time between getting your job allocation from Slurm and jupyterhub-singleuser start up, it's just when you select a kernel that you see things be painfully slow. (Your first issue was about spawning notebooks, which I took to mean spawning jupyterhub-singleuser). It sounds like really there's no Jupyter problem here per se, just Python being slow about loading dependencies from CVMFS? Do you see the same behavior when you try to import packages from your CVMFS-based conda env from just a regular script?
OK, so, it sounds like you don't have a problem with the time between getting your job allocation from Slurm and jupyterhub-singleuser start up, it's just when you select a kernel that you see things be painfully slow. (Your first issue was about spawning notebooks, which I took to mean spawning jupyterhub-singleuser).
Yes, this what I mean by computing nodes affected to Slurm job
. It's more on Python, I thought that other data centers are facing similar issues. Here for example a simple import of xarray:
Thanks for clarifying. There is a widely-known issue with serving Python software stacks from parallel file systems but it's more of a problem at scale or when the file system metadata servers have a lot of traffic. There are several mitigations like caching metadata on the compute node, moving the software stack to the compute node ram disk or local node storage, etc. The best MD caching-based performance we've seen at our center is from Cray DVS read-only mounting GPFS with client-side caching turned on, but this doesn't match the node-based solutions (including containers).
I don't know enough about CVMFS to help you, but there are probably other contributors here who can recommend tunings.
Thanks @rcthomas for your contribution. We can leave this open now, maybe you can label it as discussion
.
Dear Batchspawner developers/users,
this is not an issue, just asking for best practices/advices. Technically, our Jupyterhub (1.1.0) deployment on HPC using batchspwner (1.0.0) works fine. However, there are still some drawbacks that reduce the use of the server.
Anyone facing this problem? is there a way to speed up those processes on HPC?
For information, we use modules for python packages and also jupyterhub-singleuser is located in a conda env.
Thank you in advance.