mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
28 stars 11 forks source link

dask autolaunching issue with conda package #531

Open pavlis opened 2 months ago

pavlis commented 2 months ago

I'm running into some odd behavior trying to use dask with the new "LocalCluster" functionality using the new conda package. The "bottom line" is that when I create an instance of LocalCluster in a jupyter notebook running locally (not in the docker container but locally) I get this error message:

/home/pavlis/anaconda3/envs/mspass_py310/lib/python3.10/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 34081 instead
  warnings.warn(

I can connect to the standard dask dashboard after that message on port 34081, but when I try to do any dask operation it fails with a long exception chain ending with this:

RuntimeError: Error during deserialization of the task graph. This frequently
occurs if the Scheduler and Client have different environments.
For more information, see
https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments

The message is clearly showing that my notebook instance is trying to connect to the scheduler on the wrong port. Thus, even though I could connect to the diagnostic dashboard dask will not run in this context.

I do not know where this is coming from. Note:

  1. Stopping and restarting the jupyter notebook server does not fix the problem. I get exactly the same message BUT a different port number is listed instead.
  2. I used ps to see if I could find something running that might explain this. Nothing I could see.
  3. The smoking gun appears to be the docker container. The rest of this post discusses what I was doing and how I conclude docker is the "smoking gun".

I was running with the new conda package and using the technique I posted in the wiki a couple weeks ago. That is, I was using the container only intending to use it to run MongoDB. The launch line was this:

docker run --env MSPASS_ROLE=db  -p 27017:27017 --mount src=`pwd`,target=/home,type=bind mspass/mspass

It was definitely working to run Mongo as the stuff I was running was using MongoDB and it all worked fine. The problem surfaced only when I tried to use dask. I conclude the docker container is the "smoking gun" for causing this problem because as soon as I stopped the container running I could instantiate an instance of LocalCluster without getting the error shown above.

I looked through the startup script and I cannot see how this is happening. When "MSPASS_ROLE" is set to "db" as above nothing I can see references dask. Hence, it is possible the observation that the gun is smoking doesn't mean the docker container is the killer and this is happening some other way. Do any of you have any idea how I we can sort this out?

wangyinz commented 2 months ago

The only possibility is that you have a dask scheduler running for some reason that you are not aware of. I am pretty sure that the particular mongodb container will not be the issue because it does not map the port 8787 at all. Have you tried the docker ps command to list all currently running containers? Other than that, I can't think of anything else. Well, maybe my question is, have you ever get LocalCluster running without complaining the port being used? If so, how?

pavlis commented 2 months ago

In a meeting a short time ago @wangyinz gave some suggestions that helped solve this problem. The cause and solution are explained well here. The problem that created this was I was using suggestions I found online that were apparently wrong. I did this:

cluster = LocalCluster()
client = Client(cluster)

The second line was what was throwing the error. It appears to not be necessary.

The source referenced above shows what seems to be the right way to handle this. That is, you more or less create the scheduler and worker as a separate process much as we do in the container. Then the processing job just connects to the instance of "LocalCluster" created externally. There seems to be a lot of misinformation on this topic on the web. A lot of it is created, actually, by implicit launches of dask that can occur in some contexts like creating and using a bag.

I am going to mark this issue closed.

pavlis commented 2 months ago

I acted prematurely in closing this issue. The potential solution I noted above did not fix this problem, but actually kind of made it worse. What happened, however, I think suggests what the fundamental problem is that is causing this behavior. What I observed is if I launch dask-scheduler and a dask-worker in separate windows, which makes them run as an external abstraction of a cluster definition, when I do our stock incantation to access a database:

from mspasspy.db.database import Database
import mspasspy.client as msc
dbclient=msc.Client()
db = dbclient.get_database('scoped2024')

It immediately issues the complain above about port 8787 already being in use. In this case that is expected since I launched dask independently. The thing this shows, however, is that instantiating an instance of Database is causing dask to be invoked somewhere. When I look at the client.py file I see that all get_database really does is call the constructor of the Database class. The constructor does not reference dask as far as I can see. However, database.py has some odd odd constructs I don't understand in the import lines:


try:
    import dask.bag as daskbag

    _mspasspy_has_dask = True
except ImportError:
    _mspasspy_has_dask = False

try:
    import dask.dataframe as daskdf
except ImportError:
    _mspasspy_has_dask = False

try:
    import pyspark

    _mspasspy_has_pyspark = True
except ImportError:
    _mspasspy_has_pyspark = False

Why are those necessary? I haven't dug into the monstrosity of the modules that are referenced there, but from the behavior it looks like they don't just load code but do some kind of initialization. I don't know how or even if you could do that in a python module, but that seems to be what this is doing. dask is ALWAY referenced in the Database class in the DataFrame sections but none of that is referenced in the constructor. If someone in the group can't enlighten me on this I'll need to step into the code with an interactive debugger to try to sort this out. I realize I lack a fundamental understanding of what happens in an "import" command.

This behavior, by the way, is troublesome for our near term plans as it could break the cloud implementation with "coiled" that depends on the conda package. The concern is something we have here will always break if an instance of the abstraction of a dask distributed cluster (i.e. LocalCluster, SLURMCluster, etc.) is used with our Database class. That is a pessimistic view. I think it is more likely only an issue with LocalCluster as port contention only happens if you are running the components of mspass on a single host. When we run with the container none of this could happen when mspass is run on a cluster because each "role" is isolated to a container. When running on a desktop, however, that isolation does not occur. That, at least, is my working hypothesis to explain this behavior.

wangyinz commented 2 months ago

Those import lines are irrelevant. They are used for the code to figure out whether we have dask or pyspark installed, such that the default API of corresponding one can be used. This is now obsolete inside the database module because the move of the read_distributed_data function, and we actually need to move them there as I just checked that the new distributed module does not have that.

With that being said, I think you found exactly the problem, which is inside the constructor of our client: https://github.com/mspass-team/mspass/blob/33e74d86cbadd50fce05c77e51083dc4ffb52bb1/python/mspasspy/client.py#L206-L235 As you can see at line 229, if a dask scheduler is not being detected, it actually tries to create one. It actually make sense, but that means users are not expected to created their own scheduler.

pavlis commented 2 months ago

I think two things are needed to address this issue and @wangyinz is the only one the group who can effectively accomplish these:

  1. Seems the constructor for mspass.client.Client needs options that allow it to be created without launching dask or spark. Both situations are common in running a serial job. I think the solution for a serial job is to not instantiate an instance of Client but just create a Database handle directly. Then you never hit this issue. That detail could be handled by documentation.
  2. I hadn't quite realized the scale and importance of the documentation gap we have on the Client that you reference here. I hadn't looked at this closely before, but remember discussion in the past about the need for this beast to allow a Database handle to be serialized. After perusing the Client.py I conclude: (1) you need to improve the docstring as this is a core class for MsPASS, and (2) there should be a user manual page on this topic. This is way too important and complex to assume someone can figure it out without more guidance. A User Manual page is needed as there are some strange concepts here few if any of our users would be able to sort through without help.
pavlis commented 2 months ago

One small correction/addition to the previous comment. The docstring for the Database constructor actually mentions that it should be used only for serial jobs. So, we have at least an oblique reference to this problem in our documentation. It still screams to have the docstring for mspasspy.client.Client improved and a User Manual section written on this topic.

pavlis commented 2 months ago

Also, I confirm that that approach does address that problem when using the conda package. BUT there is a bit caveat. The Database constructor fails UNLESS the environment variable MSPASS_HOME is define AND contains the default mspass.yaml schema file in the data/yaml directory.

I do not know if that can be handled in the package definition. A lot of packages have some kind of data directory to hold various required data files. If that cannot be done automatically it absolutely will require clear directions in the User Manual.

wangyinz commented 2 months ago

I am a little confused. I thought we don't need the MSPASS_HOME as we have the following line that read the one installed within mspasspy: https://github.com/mspass-team/mspass/blob/33e74d86cbadd50fce05c77e51083dc4ffb52bb1/python/mspasspy/db/schema.py#L23-L26

I guess the error you saw may be from other places than this schema class. Do you have a back trace of that? We only need to add the same handling in whatever code that is throwing the error.

pavlis commented 2 months ago

That works when running with the container but not for a use with the conda package. In that environment pwd can be anything. Then again, maybe I misunderstand what file resolves to in that code. All I know is it failed but when I defined MSPASS_HOME the same code worked.

wangyinz commented 2 months ago

The __file__ resolves to the directory of the current script, which is where the schema.py is being installed in in this context. Because we also install the data dir into the root dir of mspasspy, it should always work as long as the installation is correct. What you ran into might suggest that the conda install is incorrect. Maybe this is something @Aristoeu should address - whether the data dir is installed to the right place.

wangyinz commented 2 months ago

hmmm.... weird, I just checked and it seems the conda install does include the data dir at the correct location...

wangyinz commented 2 months ago

I just tried with a clean conda install and the following code works fine:

from mspasspy.db.database import Database
import mspasspy.client as msc
dbclient=msc.Client()
db = dbclient.get_database('getting_started')

I also tried calling the SchemaBase constructor explicitly and it also worked fine. Note that I was following the exact steps in https://github.com/mspass-team/mspass/wiki/Desktop-Use-with-Anaconda and this is done in a clean centos 7 container that has no mongodb or dask pre-installed. I also made sure that there is no MSPASS_HOME being set.

BTW, I checked and it shows that the above code correctly started a local dask cluster, so it works as expected:

>>> dbclient.get_scheduler()
<Client: 'tcp://127.0.0.1:39235' processes=4 threads=16, memory=50.09 GiB>
pavlis commented 2 months ago

After your experience I think I understand what is happening here. I have the conda package superimposed on a local build I installed via

pip install --user .

run in the top level of the mspass repository. Indeed when I look I see that the local library is overriding the conda package.

This is a good lesson because I can see two uses for the conda package:

  1. The reason we built it, which is to facilitate cloud computing with coiled.
  2. A simplified way for user's to run mspass on a desktop where they have full control over their conda environment. With the conda package one could develop pure python code without the baggage of the building the compiled C++ code.

The solution to this issue is thus in the documentation. I suggest we proceed in two stages:

  1. I think @wangyinz needs to update the wiki page I created a while back and revised (incorrectly) yesterday found here. What you need to add there is clear instructions on how to set up the python path to assure the conda package overrides any potentially conflicting packages installed on the system.
  2. Once you educate me I promise to write a User Manual section on using MsPASS with Anaconda.

This issue also may be the solution to resolving the problem we have encountered build the arm64 version of the conda package. Recall the obspy dependency of mspass presented a problem for the package and seems to require some other solution like pip to install obspy on arm64 machines. Installing obspy via pip is effectively the same issue that caused me to start this issue as I was using conflicting packages installed with pip and conda. This is, in fact, a case in point about the biggest single problem with python that makes the container such an important solution for MsPaSS: the nearly inevitable python package conflicts that happen when mixing package managers.

pavlis commented 2 months ago

You all can kick me, but much of the confusion in the later part of this post was created by a blunder on my part. The problem I had with MSPASS_HOME was due to a blunder in activating the wrong environment with anaconda.

The original reason for posting this issue, however, remain important. That is, the MsPASS client can launch a second instance of dask in some situations. I don't think this issue should be closed until we fill the documentation gaps noted above. That means:

  1. Improve the docstring for mspasspy.client.Client
  2. When the dust settles we absolutely must include a fairly comprehensive user manual page on how to use the conda package in different environments. That is, my recent experience confirms the way you should use it for a local install is not the same as you would use it with coiled, which is the actual reason we pushed that development anyway. I think your perspective, @wangyinz , that the use in a cloud system with something like coiled or on an HPC system with something like SLURMCluster is a more advanced usage is true. On the other hand, I know from previous short courses that the conda package for desktop use is likely to be the most common use of mspass on desktops. For all the weaknesses, it is what people know and where most are likely to first get their feet wet with MsPASS.
wangyinz commented 2 months ago

hmmm.... I am still a bit confused. Even if you are using the mspasspy installed by pip install --user ., it should still have the data dir installed to the user path thus the __file__ trick should still work. Actually, I can't think of a way to make it fail. As long as there is a data dir being installed, it should have a relative path to the schema.py file, and that should work fine with the __file__ trick. Could you please check with your installations whether there is a data dir? If not, I want to be able to reproduce it, since that means our setup script has issues and need a fix.

pavlis commented 2 months ago

Well, hmmm back. I'm unable to recreate the error I had that I thought required MSPASS_HOME. For now presume I did something else wrong and jumped to the wrong conclusion.

The problem of dask be relaunched is still there and something to document for running serial jobs using the conda package. Sorry to make you chase this.