mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
28 stars 11 forks source link

Documentation need: conda package and debugging a workflow #544

Open pavlis opened 1 month ago

pavlis commented 1 month ago

Now that we have a fully functional conda package from both intel and arm64 architectures there is a hole in our documentation. @wangyinz you are the one to fill this hole as I am not sure exactly how to best do handle it.

This quickly gets into inconsistencies in pip- versus conda installs and how it all interacts with a local python environment. There are multiple ways, I think, for any of us to screw this up because pip and conda are two not always compatible package managers. Ways I know that cause confusion are:

  1. Our wiki page on building mspass from source code advises to us pip install --user ./ run from the top of the source tree after compiling the C++ code. That puts a version of mspass in the ~/.local directory (at least in linux - not so sure about macos) If one does that and has the conda package nstalled in some environment, which takes precedence and how do you know if it does?
  2. Few if any users are likely to want or need to dig into the C++ code base other than to read to code to better understand what it does. Therefore, i suspect something we should advise is that people can use their favorite ide locally to develop a workflow code. If they are then using the conda package and they are set up correctly they should be able to step into any of our python code base to debug something that is is failing that requires a debugger to solve. That is a need people will ALWAYS have and most people prefer a gui based debugger to gdb or pdb. Our documentation needs to explain how this should be done. I think I posted on this issue either in a discussion or issues page and we decided then the problem was too difficult to deal with when running the container version of mspass. With the conda package that is not at all true. The spyder IDE, in fact, comes automatically with a full conda install.
  3. The documentation really should try to concisely explain pip and conda and how they can and cannot be mixed. I find internet sources on this topic very confusing. I suspect the reason is a large fraction of python users are confused by the topic.l

Bottom line is that @wangyinz needs to start a new documentation page on this general topic. I promise to extend it if you can start the process - I'm too confused myself to write anything that isn't potential misinformation.

pavlis commented 1 month ago

Case in point I just encountered that also shows something we need to document. We have a problem in that our DBClient class has an intimate connection to dask/pyspark. I'm trying to debug a problem with a notebook using spyder. I converted the notebook to a python script to do that. However, it won't run in the local mode because of this infamous error being thrown by dask when I instantiate DBClient:

024-06-06 09:03:24,359 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
  File "/home/pavlis/anaconda3/envs/mspass_py310/lib/python3.10/site-packages/distributed/nanny.py", line 448, in instantiate
    result = await self.process.start()
  File "/home/pavlis/anaconda3/envs/mspass_py310/lib/python3.10/site-packages/distributed/nanny.py", line 748, in start
    await self.process.start()
  File "/home/pavlis/anaconda3/envs/mspass_py310/lib/python3.10/site-packages/distributed/process.py", line 55, in _call_and_set_future
    res = func(*args, **kwargs)
  File "/home/pavlis/anaconda3/envs/mspass_py310/lib/python3.10/site-packages/distributed/process.py", line 215, in _start
    process.start()
  File "/home/pavlis/anaconda3/envs/mspass_py310/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/pavlis/anaconda3/envs/mspass_py310/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/home/pavlis/anaconda3/envs/mspass_py310/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)

-- many more error lines --

I've seen this many times and know it is a dask collision problem. Note I'm not using our container but what I think is a pure local environment. I know a workaround for this particular problem - I will just modify my script to not instantiate DBClient but use the raw MongoDB interface. This points to yet another documentation that you, @wangyinz , need to fix. In the parallel processing section of the manual we need a section explaining what DBClient is and why it structured the way it is. I know approximately; it is a necessary evil to allow at Database object to serialize - that is why this topic belongs in the parallel processing section. A more detailed description of the concepts and why one should always use get_database and get_database_client rather than instantiate Database and MongoClient (or something like that) directly.

wangyinz commented 1 month ago

hmmm... I think you might be using the Client class instead of the DBClient. The DBClient should have nothing to do with the scheduler, but the Client do. It will detect the available scheduler and try to create a local cluster if none detected. I just added another else branch in the logic there so that it also works when dask and spark are both not being installed. However, I think in your case, you already have dask installed, so it is trying to connect to the scheduler.

wangyinz commented 1 month ago

I just added the conda document. For the debug related topic, maybe that belongs to a developer guide. I need to think about how to write it. This is actually pretty complicated considering debugging the C++ code. The C++ code in the conda package does not have the debug symbols, so we will need a local build anyway in this case.