Open danielsvedberg opened 2 years ago
Not sure if it finally worked just because I kept running it over and over again or if the env was actually good, but anyways, here's pip freeze for a successful run:
(blechpy) dsvedberg@dsvedberg-Z370M-DS3H:~$ pip freeze appdirs==1.4.4 backcall==0.1.0 blechpy==2.1.26 bokeh==1.4.0 certifi==2022.5.18.1 cffi==1.14.5 chardet==4.0.0 Click==7.0 cloudpickle==1.3.0 colorama==0.4.4 colorcet==2.0.2 cryptography==3.4.7 cycler==0.10.0 dask==2.11.0 datashader==0.10.0 datashape==0.5.2 debugpy==1.5.1 decorator==4.4.2 dill==0.3.1.1 distributed==2.11.0 easygui==0.98.1 entrypoints==0.3 feather-format==0.4.1 fonttools==4.33.3 fsspec==0.6.2 h5py==2.10.0 HeapDict==1.0.1 holoviews==1.12.7 idna==2.10 imageio==2.8.0 importlib-metadata==4.6.0 ipykernel==6.9.1 ipython==7.32.0 ipython-genutils==0.2.0 jedi==0.16.0 jeepney==0.6.0 Jinja2==2.11.1 joblib==1.1.0 jupyter-client==6.1.12 jupyter-core==4.6.3 keyring==23.0.1 keyrings.alt==4.0.2 kiwisolver==1.1.0 littleutils==0.2.2 llvmlite==0.31.0 locket==0.2.0 Mako==1.1.3 Markdown==3.3.3 MarkupSafe==1.1.1 matplotlib==3.5.2 matplotlib-inline==0.1.3 mistune==0.8.4 mkl-fft==1.3.1 mkl-random==1.2.2 mkl-service==2.4.0 msgpack==1.0.0 multipledispatch==0.6.0 nest-asyncio==1.5.4 networkx==2.4 numba==0.48.0 numexpr==2.7.1 numpy==1.19.5 outdated==0.2.0 packaging==20.3 pandas==1.3.5 pandas-flavor==0.2.0 param==1.9.3 parso==0.6.2 partd==1.1.0 patsy==0.5.1 pdoc3==0.9.2 pexpect==4.8.0 pickleshare==0.7.5 Pillow==7.0.0 pingouin==0.3.8 pkginfo==1.8.2 prompt-toolkit==3.0.3 psutil==5.9.0 ptyprocess==0.6.0 pushbullet-tools==0.0.7 pyarrow==0.17.0 pycparser==2.20 pyct==0.4.6 Pygments==2.9.0 pynndescent==0.4.8 pyparsing==2.4.6 python-dateutil==2.8.1 pytz==2019.3 pyviz-comms==0.7.3 PyWavelets==1.1.1 PyYAML==5.4 pyzmq==22.3.0 readme-renderer==32.0 requests==2.25.1 requests-toolbelt==0.9.1 rfc3986==2.0.0 scikit-image==0.16.2 scikit-learn==1.0.2 scipy==1.3.3 seaborn==0.11.1 SecretStorage==3.3.1 six==1.16.0 sortedcontainers==2.1.0 spyder-kernels==2.1.3 statsmodels==0.11.1 tables==3.6.1 tabulate==0.8.7 tblib==1.6.0 threadpoolctl==3.1.0 toolz==0.10.0 tornado==6.0.4 tqdm==4.61.1 traitlets==5.1.1 twine==3.8.0 typing-extensions==3.10.0.0 umap-learn==0.4.6 urllib3==1.26.6 wcwidth==0.1.8 wurlitzer==3.0.2 xarray==0.15.0 zict==2.0.0 zipp==3.4.1
I believe this was helped by having joblib and matplotlib upgraded to most recent versions. Scikit-learn was also upgraded in this env
Update after continued testing: this mix of packages has not solved the problem per-se. With my latest datasets, I have been getting the error over and over again, and simply restarting dat.blech_clust_run(umap=True) until eventually it finishes. I did have an interesting moment where I realized that I had a handful of electrodes with spike arrays that terminated early because of noise cutoff, but weren't marked dead. This dataset had been throwing these errors for me during blech_clust_run, but when I reprocessed the dataset after marking the channels dead, I got a clean (and quick) blech_clust_run. I'm not sure how this explains the bug but if anything, I would make it standard practice to check all of your electrodes using dat.electrode_mapping and verifying that all the live channels have the same cutoff time before executing blech_clust_run(), and if you have channels that get cut off early, go back and start over from dat.extract_data() and marking those channels as dead during dat.mark_dead_channels().
After some researching this issue for like the 5th time I am now almost certain it's rooted in a known sci-kit learn issue where joblib and scikit-learn end up double-parallelizing various jobs, making programs using them more liable to exhaust system memory. The latest versions of joblib has some automated memory management to prevent this (when blech_clust_run runs into a series of channels with many detected spikes, it outputs messages indicating that the number of concurrent jobs has been decreased to save memory), but clearly this memory management isn't perfect, and I believe it may be failing to predict when scikit-learn multiplies memory usage during its own parallelization of jobs. It seems scikit-learn is equipped to coordinate with joblib to prevent this but maybe there is something specific in the program or environment that is misconfigured to cause this--it's clear from the scikit-learn issue thread that some people suffer from this and others do not, and that this bug is sensitive to things like: joblib version, conda updates, versions of unrelated packages like matplotlib, python version, the OS/version being used, C++ versioning, and n_jobs settings.
My main advice for anyone facing this issue is to make sure your base conda, pandas, numpy, scipy, and matplotlib are updated as new as requirements allow, and use joblib==1.0.2.
Since I'm running through a few full pipelines I will be able to test the precise versioning and release a blechpy version that has a stable set of env requirements. Updating blechpy towards python>=3.8 in the further future may be a more permanent solution though.
this thread suggests that this issue only applies to python < 3.8, so I'm testing blechpy on python 3.8.10 on Christina's computer. Upgrading python from 3.7.0 to 3.7.13 caused the computer to crash when running blech_clust_run(umap=True)
I've had this specific issue a few times so I'm opening an issue for it to log my progress on it. It's probably related to issue #35 (at least it comes and goes with different joblib versions) but I definitely want a log for this specific error message.
Below is the error message:
Below is the debugging trace:
Below is the pip freeze output for the environment that was used: