msk-mind / luna

Scripts for data processing
https://msk-mind.github.io/luna
Apache License 2.0
41 stars 10 forks source link

Tutorial update 2023 07 #378

Closed tomp closed 1 year ago

tomp commented 1 year ago

These should be all of the changes I made to the luna/docker/luna_tutorial folder, to get the first tutorial notebook to work for me. I tried to explain the reasons for these changes in the commit messages.

I should mention that the first notebook does not actually work in this rewriting without some manual intervention to tell luna where the dask scheduler is running. It should be straightforward to make that happen automatically, but I wanted to get this PR up, first.

Fixes:

  1. Replaced the dockerfile for girder with one based on their "official" dockerfile. I was unable to successfully build with the original dockerfile because of a couple of issues with the nodejs install, and an incompatibility with recent versions of openssl, neither of which was an issue with the official dockerfile.
  2. Use python3 to execute python scripts, instead of python, since the latter may still be python 2.7 on some systems. This was just an issue with one command I copied to the new girder dockerfile.
  3. Execute provision_dremio.py using conda_python wrapper, which activates the conda venv before executing the script. This is necessary when using docker exec, because .bashrc isn't loaded when docker exec is used. The need for this could be avoided if we just used poetry install with a standard python venv in the dockerfile, instead of conda.
  4. The WORKDIR was defined differently in dockerfiles for girder and for jupyter. I changed it to /home/$USER_NAME for both.
  5. In the dsa_annotation_etl.py code, I cast two pathnames (constructed with pathlib) to strings before passing them to other modules. I was getting errors from those modules complaining that that didn't recognize PosixPath types, which is what pathlib was giving us.
  6. I modified dockerfile_luna_tutorial to start up a dask scheduler and worker, because dask was throwing a huge number of errors when the dask client was closed. Those errors didn't seem to affect the result, but it was still ugly, and at least once this prevented the script from exiting at all.
  7. I modified the commands to have internal port numbers used everywhere one service needed to contact another. The main one I remember was for the make provision_dremio target, but I think at least one other interaction was broken because a command was using the wrong port.

Other changes:

  1. I split out a make summary target from make run, in the Makefile. The make summary target just prints the instructions for connecting to the various service spun up for the tutorial.
  2. I reformatted the output for make summary to something I thought was a little easier to read.
  3. I reworded some of the text in the first tutorial notebook, and removed text that seemed to be obsolete.
  4. I modified some of the code in the first notebook to use variables for some of the really long pathnames.

Remaining work:

armaank commented 1 year ago

Tom, this looks great to me, thanks for all of the changes. @raylim - can you try running the now updated tutorials?

One item that I know needs to be addressed is that some of the tutorials have out-dated function calls that need to be updated due to recent changes. Putting aside the dask issue (I think the function we have get_or_create_dask_client() might be what you're looking for), if you could go through the tutorials and verify that you can run them all w/ the latest version of luna, that would be great!

darinmoore commented 1 year ago

Not sure what the previous behavior was, but I'm unable to build this locally (probably due to M1 using ARM), but I can build it on the servers.

image

raylim commented 1 year ago

Looks like the arguments to the CLIs still have to be updated to in the jupyter notebooks.

armaank commented 1 year ago

Great! One more thing, see https://msk-mind.github.io/luna/contributing/ for the process we follow to merge into dev. Remember to squash and merge, and update you git commit messages to use the appropriate style.