lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
865 stars 105 forks source link

Questions about problems in ingest related to pynndescent #133

Open HelloWorldLTY opened 2 years ago

HelloWorldLTY commented 2 years ago

Sorry to disturb, but it seems that the ingest method in scanpy meets some problems caused by pynndescent and numba. Here are the details: KeyError Traceback (most recent call last) /usr/local/lib/python3.7/dist-packages/numba/core/caching.py in save(self, key, data) 486 # If key already exists, we will overwrite the file --> 487 data_name = overloads[key] 488 except KeyError:

KeyError: ((array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C), array(float32, 2d, C), type(CPUDispatcher(<function squared_euclidean at 0x7fdefdb13830>)), array(int64, 1d, C), float64), ('x86_64-unknown-linux-gnu', 'broadwell', '+64bit,+adx,+aes,+avx,+avx2,-avx512bf16,-avx512bitalg,-avx512bw,-avx512cd,-avx512dq,-avx512er,-avx512f,-avx512ifma,-avx512pf,-avx512vbmi,-avx512vbmi2,-avx512vl,-avx512vnni,-avx512vpopcntdq,+bmi,+bmi2,-cldemote,-clflushopt,-clwb,-clzero,+cmov,+cx16,+cx8,-enqcmd,+f16c,+fma,-fma4,+fsgsbase,+fxsr,-gfni,+invpcid,-lwp,+lzcnt,+mmx,+movbe,-movdir64b,-movdiri,-mwaitx,+pclmul,-pconfig,-pku,+popcnt,-prefetchwt1,+prfchw,-ptwrite,-rdpid,+rdrnd,+rdseed,+rtm,+sahf,-sgx,-sha,-shstk,+sse,+sse2,+sse3,+sse4.1,+sse4.2,-sse4a,+ssse3,-tbm,-vaes,-vpclmulqdq,-waitpkg,-wbnoinvd,-xop,+xsave,-xsavec,+xsaveopt,-xsaves'), ('308c49885ad3c35a475c360e21af1359caa88c78eb495fa0f5e8c6676ae5019e', 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'))

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last) 13 frames in () ----> 1 sc.tl.ingest(adata, adata_ref, obs='louvain') #ingest

/usr/local/lib/python3.7/dist-packages/scanpy/tools/_ingest.py in ingest(adata, adata_ref, obs, embedding_method, labeling_method, neighbors_key, inplace, kwargs) 131 132 if obs is not None: --> 133 ing.neighbors(kwargs) 134 for i, col in enumerate(obs): 135 ing.map_labels(col, labeling_method[i])

/usr/local/lib/python3.7/dist-packages/scanpy/tools/_ingest.py in neighbors(self, k, queue_size, epsilon, random_state) 469 self._nnd_idx.search_rng_state = rng_state 470 --> 471 self._indices, self._distances = self._nnd_idx.query(test, k, epsilon) 472 473 else:

/usr/local/lib/python3.7/dist-packages/pynndescent/pynndescent_.py in query(self, query_data, k, epsilon) 1564 """ 1565 if not hasattr(self, "_search_graph"): -> 1566 self._init_search_graph() 1567 1568 if not self._is_sparse:

/usr/local/lib/python3.7/dist-packages/pynndescent/pynndescent_.py in _init_search_graph(self) 1061 self._distance_func, 1062 self.rng_state, -> 1063 self.diversify_prob, 1064 ) 1065 reverse_graph.eliminate_zeros()

/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws) 432 e.patch_message('\n'.join((str(e).rstrip(), help_msg))) 433 # ignore the FULL_TRACEBACKS config, this needs reporting! --> 434 raise e 435 436 def inspect_llvm(self, signature=None):

/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws) 365 argtypes.append(self.typeof_pyval(a)) 366 try: --> 367 return self.compile(tuple(argtypes)) 368 except errors.ForceLiteralArg as e: 369 # Received request for compiler re-entry with the list of arguments

/usr/local/lib/python3.7/dist-packages/numba/core/compiler_lock.py in _acquire_compile_lock(*args, kwargs) 30 def _acquire_compile_lock(*args, *kwargs): 31 with self: ---> 32 return func(args, kwargs) 33 return _acquire_compile_lock 34

/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py in compile(self, sig) 823 raise e.bind_fold_arguments(folded) 824 self.add_overload(cres) --> 825 self._cache.save_overload(sig, cres) 826 return cres.entry_point 827

/usr/local/lib/python3.7/dist-packages/numba/core/caching.py in save_overload(self, sig, data) 669 """ 670 with self._guard_against_spurious_io_errors(): --> 671 self._save_overload(sig, data) 672 673 def _save_overload(self, sig, data):

/usr/local/lib/python3.7/dist-packages/numba/core/caching.py in _save_overload(self, sig, data) 679 key = self._index_key(sig, _get_codegen(data)) 680 data = self._impl.reduce(data) --> 681 self._cache_file.save(key, data) 682 683 @contextlib.contextmanager

/usr/local/lib/python3.7/dist-packages/numba/core/caching.py in save(self, key, data) 494 break 495 overloads[key] = data_name --> 496 self._save_index(overloads) 497 self._save_data(data_name, data) 498

/usr/local/lib/python3.7/dist-packages/numba/core/caching.py in _save_index(self, overloads) 540 def _save_index(self, overloads): 541 data = self._source_stamp, overloads --> 542 data = self._dump(data) 543 with self._open_for_write(self._index_path) as f: 544 pickle.dump(self._version, f, protocol=-1)

/usr/local/lib/python3.7/dist-packages/numba/core/caching.py in _dump(self, obj) 568 569 def _dump(self, obj): --> 570 return pickle.dumps(obj, protocol=-1) 571 572 @contextlib.contextmanager

TypeError: can't pickle weakref objects

How to solve this problem? Thanks. The code comes from :https://scanpy-tutorials.readthedocs.io/en/latest/integrating-data-using-ingest.html

HelloWorldLTY commented 2 years ago

For more information, you can take a look athttps://github.com/theislab/scanpy/issues/1951

lmcinnes commented 2 years ago

This is an issue with numba caching having issues saving / loading compiled versions of functions. Are you running on colab by any chance? I cannot reproduce this issue except on colab, so if you have a specific system where you can reproduce it that would be good to know. This can be worked around (albiet strangely). See issue #131

HelloWorldLTY commented 2 years ago

This happens on colab. I can provide you me setting plan: Package Version


absl-py 0.12.0 alabaster 0.7.12 albumentations 0.1.12 altair 4.1.0 anndata 0.7.6 anndata2ri 1.0.6 annoy 1.17.0 appdirs 1.4.4 argon2-cffi 20.1.0 arviz 0.11.2 astor 0.8.1 astropy 4.2.1 astunparse 1.6.3 async-generator 1.10 atari-py 0.2.9 atomicwrites 1.4.0 attrs 21.2.0 audioread 2.1.9 autograd 1.3 Babel 2.9.1 backcall 0.2.0 beautifulsoup4 4.6.3 bleach 3.3.0 blis 0.4.1 bokeh 2.3.3 Bottleneck 1.3.2 branca 0.4.2 bs4 0.0.1 CacheControl 0.12.6 cached-property 1.5.2 cachetools 4.2.2 catalogue 1.0.0 certifi 2021.5.30 cffi 1.14.6 cftime 1.5.0 chardet 3.0.4 charset-normalizer 2.0.2 click 7.1.2 cloudpickle 1.3.0 cmake 3.12.0 cmdstanpy 0.9.5 colorcet 2.0.6 colorlover 0.3.0 community 1.0.0b1 contextlib2 0.5.5 convertdate 2.3.2 coverage 3.7.1 coveralls 0.5 crcmod 1.7 cufflinks 0.17.3 cupy-cuda101 9.1.0 cvxopt 1.2.6 cvxpy 1.0.31 cycler 0.10.0 cymem 2.0.5 Cython 0.29.23 daft 0.0.4 dask 2.12.0 datascience 0.10.6 debugpy 1.0.0 decorator 4.4.2 defusedxml 0.7.1 Deprecated 1.2.12 descartes 1.1.0 dill 0.3.4 distributed 1.25.3 dlib 19.18.0 dm-tree 0.1.6 docopt 0.6.2 docutils 0.17.1 dopamine-rl 1.0.5 dunamai 1.5.5 earthengine-api 0.1.272 easydict 1.9 ecos 2.0.7.post1 editdistance 0.5.3 en-core-web-sm 2.2.5 entrypoints 0.3 ephem 4.0.0.2 et-xmlfile 1.1.0 fa2 0.3.5 fastai 1.0.61 fastdtw 0.3.4 fastprogress 1.0.0 fastrlock 0.6 fbprophet 0.7.1 feather-format 0.4.1 filelock 3.0.12 firebase-admin 4.4.0 fix-yahoo-finance 0.0.22 Flask 1.1.4 flatbuffers 1.12 folium 0.8.3 future 0.16.0 gast 0.4.0 GDAL 2.2.2 gdown 3.6.4 gensim 3.6.0 geographiclib 1.52 geopy 1.17.0 get-version 3.5 gin-config 0.4.0 glob2 0.7 google 2.0.3 google-api-core 1.26.3 google-api-python-client 1.12.8 google-auth 1.32.1 google-auth-httplib2 0.0.4 google-auth-oauthlib 0.4.4 google-cloud-bigquery 1.21.0 google-cloud-bigquery-storage 1.1.0 google-cloud-core 1.0.3 google-cloud-datastore 1.8.0 google-cloud-firestore 1.7.0 google-cloud-language 1.2.0 google-cloud-storage 1.18.1 google-cloud-translate 1.5.0 google-colab 1.0.0 google-pasta 0.2.0 google-resumable-media 0.4.1 googleapis-common-protos 1.53.0 googledrivedownloader 0.4 graphtools 1.5.2 graphviz 0.10.1 greenlet 1.1.0 grpcio 1.34.1 gspread 3.0.1 gspread-dataframe 3.0.8 gym 0.17.3 h5py 2.10.0 HeapDict 1.0.1 hijri-converter 2.1.3 holidays 0.10.5.2 holoviews 1.14.4 html5lib 1.0.1 httpimport 0.5.18 httplib2 0.17.4 httplib2shim 0.0.3 humanize 0.5.1 hyperopt 0.1.2 ideep4py 2.0.0.post3 idna 2.10 imageio 2.4.1 imagesize 1.2.0 imap 1.0.0 imbalanced-learn 0.4.3 imblearn 0.0 imgaug 0.2.9 importlib-metadata 4.6.1 importlib-resources 5.2.0 imutils 0.5.4 inflect 2.1.0 iniconfig 1.1.1 install 1.3.4 intel-openmp 2021.3.0 intervaltree 2.1.0 ipykernel 4.10.1 ipython 5.5.0 ipython-genutils 0.2.0 ipython-sql 0.3.9 ipywidgets 7.6.3 itsdangerous 1.1.0 jax 0.2.17 jaxlib 0.1.69+cuda110 jdcal 1.4.1 jedi 0.18.0 jieba 0.42.1 Jinja2 2.11.3 joblib 1.0.1 jpeg4py 0.1.4 jsonschema 2.6.0 jupyter 1.0.0 jupyter-client 5.3.5 jupyter-console 5.2.0 jupyter-core 4.7.1 jupyterlab-pygments 0.1.2 jupyterlab-widgets 1.0.0 kaggle 1.5.12 kapre 0.3.5 Keras 2.4.3 keras-nightly 2.5.0.dev2021032900 Keras-Preprocessing 1.1.2 keras-vis 0.4.1 kiwisolver 1.3.1 korean-lunar-calendar 0.2.1 librosa 0.8.1 lightgbm 2.2.3 llvmlite 0.34.0 lmdb 0.99 loompy 3.0.6 louvain 0.7.0 LunarCalendar 0.0.9 lxml 4.2.6 magic-impute 3.0.0 Markdown 3.3.4 MarkupSafe 2.0.1 matplotlib 3.2.2 matplotlib-inline 0.1.2 matplotlib-venn 0.11.6 memory-profiler 0.58.0 missingno 0.5.0 mistune 0.8.4 mizani 0.6.0 mkl 2019.0 mlxtend 0.14.0 mnnpy 0.1.9.5 more-itertools 8.8.0 moviepy 0.2.3.5 mpmath 1.2.1 msgpack 1.0.2 multiprocess 0.70.12.2 multitasking 0.0.9 murmurhash 1.0.5 music21 5.5.0 natsort 5.5.0 nbclient 0.5.3 nbconvert 5.6.1 nbformat 5.1.3 nest-asyncio 1.5.1 netCDF4 1.5.7 networkx 2.5.1 nibabel 3.0.2 nltk 3.2.5 notebook 5.3.1 numba 0.51.2 numexpr 2.7.3 numpy 1.18.1 numpy-groupies 0.9.13 nvidia-ml-py3 7.352.0 oauth2client 4.1.3 oauthlib 3.1.1 okgrade 0.4.3 opencv-contrib-python 4.1.2.30 opencv-python 4.1.2.30 openpyxl 2.5.9 opt-einsum 3.3.0 osqp 0.6.2.post0 packaging 21.0 palettable 3.3.0 pandas 1.1.5 pandas-datareader 0.9.0 pandas-gbq 0.13.3 pandas-profiling 1.4.1 pandocfilters 1.4.3 panel 0.11.3 param 1.11.1 parso 0.8.2 pathlib 1.0.1 patsy 0.5.1 pexpect 4.8.0 phate 1.0.7 pickleshare 0.7.5 Pillow 7.1.2 pip 21.1.3 pip-tools 4.5.1 plac 1.1.3 plotly 4.4.1 plotnine 0.6.0 pluggy 0.7.1 pooch 1.4.0 portpicker 1.3.9 prefetch-generator 1.0.1 preshed 3.0.5 prettytable 2.1.0 progressbar2 3.38.0 prometheus-client 0.11.0 promise 2.3 prompt-toolkit 1.0.18 protobuf 3.17.3 psutil 5.4.8 psycopg2 2.7.6.1 ptyprocess 0.7.0 py 1.10.0 pyarrow 3.0.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycocotools 2.0.2 pycparser 2.20 pyct 0.4.8 pydata-google-auth 1.2.0 pydot 1.3.0 pydot-ng 2.0.0 pydotplus 2.0.2 PyDrive 1.3.1 pyemd 0.5.1 pyerfa 2.0.0 pyglet 1.5.0 Pygments 2.6.1 pygobject 3.26.1 PyGSP 0.5.1 pymc3 3.11.2 PyMeeus 0.5.11 pymongo 3.11.4 pymystem3 0.2.0 pynndescent 0.5.4 PyOpenGL 3.1.5 pyparsing 2.4.7 pyrsistent 0.18.0 pysndfile 1.3.8 PySocks 1.7.1 pystan 2.19.1.1 pytest 3.6.4 python-apt 0.0.0 python-chess 0.23.11 python-dateutil 2.8.1 python-igraph 0.9.6 python-louvain 0.15 python-slugify 5.0.2 python-utils 2.5.6 pytz 2018.9 pyviz-comms 2.1.0 PyWavelets 1.1.1 PyYAML 3.13 pyzmq 22.1.0 qdldl 0.1.5.post0 qtconsole 5.1.1 QtPy 1.9.0 regex 2019.12.20 requests 2.23.0 requests-oauthlib 1.3.0 resampy 0.2.2 retrying 1.3.3 rpy2 3.4.5 rsa 4.7.2 s-gd2 1.8 scanpy 1.8.1 scIB 0.1.1 scikit-image 0.16.2 scikit-learn 0.22.2.post1 scikit-misc 0.1.4 scipy 1.4.1 scprep 1.1.0 screen-resolution-extra 0.0.0 scs 2.1.4 seaborn 0.11.1 semver 2.13.0 Send2Trash 1.7.1 setuptools 57.2.0 setuptools-git 1.2 Shapely 1.7.1 simplegeneric 0.8.1 sinfo 0.3.4 six 1.15.0 sklearn 0.0 sklearn-pandas 1.8.0 smart-open 5.1.0 snowballstemmer 2.1.0 sortedcontainers 2.4.0 SoundFile 0.10.3.post1 spacy 2.2.4 Sphinx 1.8.5 sphinxcontrib-serializinghtml 1.1.5 sphinxcontrib-websupport 1.2.4 SQLAlchemy 1.4.20 sqlparse 0.4.1 srsly 1.0.5 statsmodels 0.10.2 stdlib-list 0.8.0 sympy 1.7.1 tables 3.4.4 tabulate 0.8.9 tasklogger 1.1.0 tblib 1.7.0 tensorboard 2.5.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.0 tensorflow 2.5.0 tensorflow-datasets 4.0.1 tensorflow-estimator 2.5.0 tensorflow-gcs-config 2.5.0 tensorflow-hub 0.12.0 tensorflow-metadata 1.1.0 tensorflow-probability 0.13.0 termcolor 1.1.0 terminado 0.10.1 testpath 0.5.0 text-unidecode 1.3 textblob 0.15.3 texttable 1.6.4 Theano-PyMC 1.1.2 thinc 7.4.0 tifffile 2021.7.2 toml 0.10.2 toolz 0.11.1 torch 1.9.0+cu102 torchsummary 1.5.1 torchtext 0.10.0 torchvision 0.10.0+cu102 tornado 5.1.1 tqdm 4.41.1 traitlets 5.0.5 tweepy 3.10.0 typeguard 2.7.1 typing-extensions 3.7.4.3 tzlocal 1.5.1 umap-learn 0.5.1 uritemplate 3.0.1 urllib3 1.24.3 vega-datasets 0.9.0 wasabi 0.8.2 wcwidth 0.2.5 webencodings 0.5.1 Werkzeug 1.0.1 wheel 0.36.2 widgetsnbextension 3.5.1 wordcloud 1.5.0 wrapt 1.12.1 xarray 0.18.2 xgboost 0.90 xkit 0.0.0 xlrd 1.1.0 xlwt 1.3.0 yellowbrick 0.9.1 zict 2.0.0 zipp 3.5.0

keherri commented 2 years ago

running into the same issue on jupyter notebook in aws sagemaker

lmcinnes commented 2 years ago

I wish I had better answers, but this very much seems to be an issue with cloud services, and how they actually back their "local" storage which is used for caching numba compiled functions. I would strongly suggest you take this a little further upstream: wither with the cloud providers, or with the numba team, or both, since this is definitely beyond my expertise.

stuartarchibald commented 2 years ago

If you make a directory in e.g. /tmp for example /tmp/numba_cache and then set the environment variable NUMBA_CACHE_DIR to point to that i.e. export NUMBA_CACHE_DIR=/tmp/numba_cache, does that help?

HelloWorldLTY commented 2 years ago

If you make a directory in e.g. /tmp for example /tmp/numba_cache and then set the environment variable NUMBA_CACHE_DIR to point to that i.e. export NUMBA_CACHE_DIR=/tmp/numba_cache, does that help?

Emm, I use this sentence to change the cache dir: IPython.paths.set_ipython_cache_dir = '/content/tmp/numba_cache' But I got the same errors again, which seems that this method cannot solve this problem.

Could you please be more specific? What should I do to change NUMBA_CACHE_DIR=/tmp/numba_cache? Thanks

HPLegion commented 2 years ago

IPython.paths.set_ipython_cache_dir = '/content/tmp/numba_cache'

This issue is not about the IPython cache but about numba's cache. Messing with the Ipython cache is probably something you want to avoid.

NUMBA_CACHE_DIR is meant to be a system environment variable that numba is reading while it sets itself up. On POSIX systems you can usually set them with export NUMBA_CACHE_DIR=... (I don't know if COLAB allows this through shell escapes) or you can set it by using pythons os.environ. The important thing is to change it BEFORE you import numba. Then numba should try and use the given directory as a cache directory.

HelloWorldLTY commented 2 years ago

IPython.paths.set_ipython_cache_dir = '/content/tmp/numba_cache'

This issue is not about the IPython cache but about numba's cache. Messing with the Ipython cache is probably something you want to avoid.

NUMBA_CACHE_DIR is meant to be a system environment variable that numba is reading while it sets itself up. On POSIX systems you can usually set them with export NUMBA_CACHE_DIR=... (I don't know if COLAB allows this through shell escapes) or you can set it by using pythons os.environ. The important thing is to change it BEFORE you import numba. Then numba should try and use the given directory as a cache directory. Emm if I use os.environ = '/content/tmp/numba_cache', it seems that it takes more time for me to load the normal packages. Is there any better solution?

HPLegion commented 2 years ago

Emm if I use os.environ = '/content/tmp/numba_cache', it seems that it takes more time for me to load the normal packages. Is there any better solution?

Is this the code you use on COLAB or did you make a typo when copying it here? It should be

import os

os.environ["NUMBA_CACHE_DIR"] = "/..."

If you did actually write os.environ = ..., then I would expect that the python interpreter session crashes, or is at least very compromised.

HelloWorldLTY commented 2 years ago

Ok, thanks.

获取 Outlook for iOShttps://aka.ms/o0ukef


发件人: Hannes Pahl @.> 发送时间: Friday, August 6, 2021 7:43:14 PM 收件人: lmcinnes/pynndescent @.> 抄送: Liu, Tianyu @.>; Author @.> 主题: Re: [lmcinnes/pynndescent] Questions about problems in ingest related to pynndescent (#133)

Emm if I use os.environ = '/content/tmp/numba_cache', it seems that it takes more time for me to load the normal packages. Is there any better solution?

Is this the code you use on COLAB or did you make a typo when copying it here? It should be

import os

os.environ["NUMBA_CACHE_DIR"] = "/..."

If you did actually write os.environ = ..., then I would expect that the python interpreter session crashes, or is at least very compromised.

― You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/lmcinnes/pynndescent/issues/133#issuecomment-894202816, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKKTOY6C7EHKA2IKXINHASDT3PDFFANCNFSM5A5COJZQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

HelloWorldLTY commented 2 years ago
"NUMBA_CACHE_DIR"

Sorry to disturb you again, it seems that this method still cannot work. image