pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE
https://opentsne.rtfd.io
BSD 3-Clause "New" or "Revised" License
1.44k stars 158 forks source link

[Windows] saving/loading TSNEEmbedding objects to pickle, Directory error #210

Open tomcsojn opened 2 years ago

tomcsojn commented 2 years ago
Expected behaviour

On Windows OS, when trying to save the TSNEEmbedding object, or affinities, tried to save it with pickle.dump(embeddings,open(os.path.join(self.models_path,"tsne_global_embeddings.sav"),"wb")) or also tried to save as array to reconstruct the object later using numpy.save("file.npy",affinities)

These lines both work just fine under linux distributions what I tried. But loading them back on Windows breaks with the same error as on the save methods, both scenario.

Actual behaviour

Windows OS can't find/create the temporary directory/files when trying to touch the file. unfortunately I haven't had more time to look deeper into it yet, what could cause this behaviour.

*** NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\tomcs\\AppData\\Local\\Temp\\tmp7biujwdz\\tmp.ann

Steps to reproduce the behavior

opentsne==0.6.2

I think this would be the same with most settings. although I am using the following settings to train before trying to save.

affinities = openTSNE.affinity.PerplexityBasedNN( X, perplexity=500, n_jobs=32, random_state=0, ) init = openTSNE.initialization.pca(X,n_components=3, random_state=42) tsne = openTSNE.TSNE(3, exaggeration=None, n_jobs=16, verbose=True, negative_gradient_method ="bh" ) embeddings = tsne.fit(affinities=affinities, initialization=init) pickle.dump(embeddings,open("tsne_global_embeddings.sav","wb"))

gaardhus commented 2 years ago

I've been having the same issue. After downgrading to openTSNE==0.6.0 I was able to both save and load the fitted model.

The example data from the README however works for both 0.6.0, 0.6.1 and 0.6.2 for me. This could possibly have something to do with number of observations and/or the resulting model size?

This is the error message I get, I tried deleting the temp folder before re-running and also tried restarting my computer without any success.

Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\shutil.py", line 625, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\tobia\\AppData\\Local\\Temp\\tmpvvo0p0t8\\tmp.ann'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\tempfile.py", line 805, in onerror
    _os.unlink(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\tobia\\AppData\\Local\\Temp\\tmpvvo0p0t8\\tmp.ann'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\tobia\Programming\generate_example_data.py", line 126, in <module>
    pickle.dump(dv_model, f)
  File "C:\Users\tobia\Programming\venv\lib\site-packages\openTSNE\nearest_neighbors.py", line 353, in __getstate__
    b64_index = base64.b64encode(f.read())
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\tempfile.py", line 830, in __exit__
    self.cleanup()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\tempfile.py", line 834, in cleanup
    self._rmtree(self.name)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\tempfile.py", line 816, in _rmtree
    _shutil.rmtree(name, onerror=onerror)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\shutil.py", line 757, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\shutil.py", line 627, in _rmtree_unsafe
    onerror(os.unlink, fullname, sys.exc_info())
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\tempfile.py", line 808, in onerror
    cls._rmtree(path)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\tempfile.py", line 816, in _rmtree
    _shutil.rmtree(name, onerror=onerror)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\shutil.py", line 757, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\shutil.py", line 608, in _rmtree_unsafe
    onerror(os.scandir, path, sys.exc_info())
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\shutil.py", line 605, in _rmtree_unsafe
    with os.scandir(path) as scandir_it:
NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\tobia\\AppData\\Local\\Temp\\tmpvvo0p0t8\\tmp.ann'

From the example which runs without any errors:

import pickle
import openTSNE
from openTSNE import TSNE
from sklearn import datasets
print(openTSNE.__version__)

iris = datasets.load_iris()
x, y = iris["data"], iris["target"]

embedding = TSNE().fit(x)

with open('iris.pkl', 'wb') as f:
    pickle.dump(embedding, f)

with open('iris.pkl', 'rb') as f:
    embedding = pickle.load(f)

embedding.transform(x)
pavlin-policar commented 1 year ago

In v0.6.2, I fixed the pickling behaviour so it works as one would expect. Annoy, which is used to find nearest neighbors, is not picklable, but uses its own internal file structure. So in previous versions, the annoy nearest neighbor index was saved to a separate file. So, pickling a TSNEEmbedding object actually produced two files, an annoy file and a pickle file. This was very fragile and there were just a ton of problems if you wanted to move files around.

In v0.6.2, I got rid of this and just included everything in the one pickle file. What happens now is that the annoy file is still saved to disk, but in a temporary directory (this is the reason for the C:\\Users\\tobia\\AppData\\Local\\Temp\\tmpvvo0p0t8\\tmp.ann directory). This file is then serialized into a binary string and saved to the pickle. When unpickling, we then write this binary string into another temporary file, which annoy can then read.

As to your problem: Would it be possible that you don't have permission to write to the Windows tmp directory? I don't have a windows machine, so I can't really test this out. There doesn't seem to be anything wrong with the actual path to the temporary file.

I'll close this for now, but if this problem persists, please reopen this issue.

hageldave commented 5 months ago

Something is still causing problems with this temporary annoy file on windows. See #210. Why are you saving to disk anyways? Can it not be written to memory, like a byte buffer (io.BytesIO)?

pavlin-policar commented 2 months ago

This is still an ongoing problem and limited to Windows. I'm not sure if it could be written to a byte buffer, but my feeling is that it would be difficult.

Currently, we depend on Annoy for nearest neighbor search, but, since Annoy is not available on conda-forge (or wasn't when I was incorporating it), so we have the code in the openTSNE/dependencies directory. The thing is that Annoy has it's own saving functions, which we use, but these only support saving to a file on disk.

I'm sure we could go tinker around with their source code to support this, but this would make upgrading Annoy to newer versions a nightmare, since my current approach is to simply copy/paste their implementation. I'm also not familiar with their codebase, nor that proficient in C++ in general, so I really don't want have to change the annoy implementation at all. This would make maintenance very difficult.

I don't have access to a windows machine, so debugging this is very difficult for me. I'm not really sure why this would be happening in any case, since the problem seems to be that we're just creating a temporary directory, which should be supported by the standard Python library.

If anyone has any idea on how to tackle this, I'd be very grateful.

pavlin-policar commented 2 months ago

The same problems were reported in #244 and #260.

I am moving the discussion here, so it's all in one spot.

pavlin-policar commented 2 months ago

This seems highly relevant, from #244

I'm experiencing a similar issue when working on Ubuntu with NFS as the openTSNE file system. What we see is that nearest_neighbors is trying to delete a file located on the NFS, a file that is still in use by the process it self. The error we see is (opentsne==1.0.0):

File "/usr/local/lib/python3.10/dist-packages/openTSNE/nearest_neighbors.py", line 358, in __setstate__
    with tempfile.TemporaryDirectory() as dirname:
2024-04-16T13:27:42.383059144Z   File "/usr/lib/python3.10/tempfile.py", line 1008, in __exit__
2024-04-16T13:27:42.383061357Z     self.cleanup()
2024-04-16T13:27:42.383062548Z   File "/usr/lib/python3.10/tempfile.py", line 1012, in cleanup
2024-04-16T13:27:42.383063695Z     self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
2024-04-16T13:27:42.383064831Z   File "/usr/lib/python3.10/tempfile.py", line 994, in _rmtree
2024-04-16T13:27:42.383066005Z     _rmtree(name, onerror=onerror)
2024-04-16T13:27:42.383067058Z   File "/usr/lib/python3.10/shutil.py", line 731, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
2024-04-16T13:27:42.383069274Z   File "/usr/lib/python3.10/shutil.py", line 729, in rmtree
2024-04-16T13:27:42.383070790Z     os.rmdir(path)
2024-04-16T13:27:42.383072030Z OSError: [Errno 39] Directory not empty: '/tmp/tmp9zxbbwwq'

My assumption is that the code "works" without issues on standard unix/linux based systems due to the delete on last close practice- A practice in Unix/Linux, where an application has a file open but issues a delete (unlink) on that file anyway. In a native Linux file system (as opposed to NFS and of course, Windows OS), this will result in the file becoming invisible to other processes, even though it still exists and is still open.

Perhaps I'm wrong but this makes sense to me, especially after reading this comment in issue #210

Originally posted by @shmulikah in https://github.com/pavlin-policar/openTSNE/issues/244#issuecomment-2059878936

hageldave commented 1 month ago

Alright, I dug a little in documentation regarding tempfile I think they describe the exact issue that is happening here. From the documentation:

On Windows, if delete_on_close is false, and the file is created in a directory for which the user lacks delete access, then the os.unlink() call on exit of the context manager will fail with a PermissionError. This cannot happen when delete_on_close is true because delete access is requested by the open, which fails immediately if the requested access is not granted.

While this is not exactly what is happening in openTSNE code (creating a temporary directory instead and letting annoy create a file inside) it is probably related. https://github.com/pavlin-policar/openTSNE/blob/78c84b6b33de97977006f0b41d85b295215758ff/openTSNE/nearest_neighbors.py#L323-L330

If somebody has time and windows OS, they could try and fiddle around here. For example create the "tmp.ann" beforehand as temporary file using tempfile.NamedTemporaryFile(... , delete_on_close=True).

pavlin-policar commented 1 month ago

Thanks for digging into this! Unfortunately, I don't have access to a windows machine, so I can't test this out. So, if I understand correctly, a potential fix is to replace the temporary directory with this tempfile.NamedTemporaryFile with delete_on_close=True?