scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
576 stars 152 forks source link

Opening file with "w" while it's open in backed mode elsewhere still deletes file contents #522

Open Hrovatin opened 3 years ago

Hrovatin commented 3 years ago

When I open adata in backed='r' mode in one script and then I try to modify the same adata/file from another script the adata file gets corrupted (its size becomes 0). I would expect writing to return an error before corrupting the file.

Versions


-----
anndata     0.7.4
scanpy      1.7.1
sinfo       0.3.1
-----
PIL                 7.2.0
anndata             0.7.4
backcall            0.2.0
cairo               1.19.1
cffi                1.14.0
cloudpickle         1.3.0
colorama            0.4.4
cycler              0.10.0
cython_runtime      NA
dask                2.21.0
dateutil            2.8.0
decorator           4.4.2
future_fstrings     NA
get_version         2.1
google              NA
h5py                2.10.0
igraph              0.8.2
ipykernel           5.3.3
ipython_genutils    0.2.0
jedi                0.17.2
joblib              0.16.0
kiwisolver          1.2.0
legacy_api_wrap     0.0.0
leidenalg           0.8.1
llvmlite            0.35.0
louvain             0.6.1
matplotlib          3.3.0
mpl_toolkits        NA
natsort             7.0.1
numba               0.52.0
numexpr             2.7.1
numpy               1.19.4
packaging           20.9
pandas              1.0.5
parso               0.7.0
pexpect             4.8.0
pickleshare         0.7.5
pkg_resources       NA
prompt_toolkit      3.0.5
psutil              5.7.2
ptyprocess          0.6.0
pygments            2.6.1
pyparsing           2.4.7
pytz                2020.1
scanpy              1.7.1
scipy               1.4.1
setuptools_scm      NA
simplejson          3.17.2
sinfo               0.3.1
six                 1.15.0
sklearn             0.23.1
storemagic          NA
tables              3.6.1
texttable           1.6.2
tlz                 0.10.0
toolz               0.10.0
tornado             6.0.4
traitlets           4.3.3
typing_extensions   NA
wcwidth             0.2.5
yaml                5.3.1
zmq                 19.0.1
zope                NA
-----
IPython             7.16.1
jupyter_client      6.1.6
jupyter_core        4.6.3
notebook            6.0.3
-----
Python 3.8.5 | packaged by conda-forge | (default, Jul 22 2020, 17:31:50) [GCC 7.5.0]
Linux-3.10.0-1160.11.1.el7.x86_64-x86_64-with-glibc2.10
64 logical CPU cores, x86_64
-----
Session information updated at 2021-03-21 21:14

</details>
ivirshup commented 3 years ago

Could you share an example of what code you ran?

My impression was hdf5 generally prevented modifying a file other processes were reading.

Hrovatin commented 3 years ago

Do this in one terminal/notebook:

>>> import anndata
>>> import scanpy as sc
>>> import os
>>> import pandas as pd
# Make the file
>>> a=anndata.AnnData(pd.DataFrame([[1,2],[3,4]]))
>>> a.write('temp.h5ad')
# Open file in backed mode
>>> a=sc.read('temp.h5ad',backed='r')
# Check file size - file is not empty based on st_size
>>> os.stat('temp.h5ad')
os.stat_result(st_mode=33188, st_ino=5418427297, st_dev=47, st_nlink=1, st_uid=141653, st_gid=20000, st_size=11048, st_atime=1616392873, st_mtime=1616392873, st_ctime=1616392873)

Keep the above running, do the below in another terminal/notebook:


>>> import scanpy as sc
>>> import os
# Open the adata file again and do some work on it, during this the file is still ok (see os.stats st_size)
>>> a=sc.read('temp.h5ad')
>>> a.obs_names=['a','b']
>>> os.stat('temp.h5ad')
os.stat_result(st_mode=33188, st_ino=5418427297, st_dev=47, st_nlink=1, st_uid=141653, st_gid=20000, st_size=11048, st_atime=1616392873, st_mtime=1616392873, st_ctime=1616392873)
# Now try to save the adata to file - the same as is opened in another terminal in basked mode
>>> a.write('temp.h5ad')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/anndata/_core/anndata.py", line 1846, in write_h5ad
    _write_h5ad(
  File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/anndata/_io/h5ad.py", line 87, in write_h5ad
    with h5py.File(filepath, mode) as f:
  File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/h5py/_hl/files.py", line 406, in __init__
    fid = make_fid(name, mode, userblock_size,
  File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/h5py/_hl/files.py", line 179, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 108, in h5py.h5f.create
OSError: Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
# Now st_size drops to 0
>>> os.stat('temp.h5ad')
os.stat_result(st_mode=33188, st_ino=5418427297, st_dev=47, st_nlink=1, st_uid=141653, st_gid=20000, st_size=0, st_atime=1616393032, st_mtime=1616393032, st_ctime=1616393032)
ivirshup commented 3 years ago

So it looks like this can be reproduced without anndata, just h5py. It's weird to me that it throws an error but also deletes the file (or at least all the contents, but the filesystem still has the path). I can follow this up upstream.

ivirshup commented 3 years ago

@Hrovatin, I'd check out the issue at h5py for more info. As it pertains to AnnData though, what exactly were you trying to do?

Did you want the file to be overwritten by the second process, or were you looking for it to update it? E.g. would you expect one of those processes to error?

Hrovatin commented 3 years ago

I had file open in one process in basked mode. I then decided that I need to make some stuff in another notebook and forgot about having the file open in the first notebook. When I tried to save the file from the second notebook it corrupted it. I would expect it at least to err out before corrupting it, so that I could close the file in the first notebook before trying to save it in the second notebook. I think not letting you modify a file while it is opened as basked somewhere else is the safest option.

Hrovatin commented 2 years ago

Any update on this? yet another of my adatas was deleted this way.

Hrovatin commented 2 years ago

@ivirshup Maybe a temporary solution would be info on how to close a backed file - now I usually just copy info from adata and delete the object immediately, but if there was a "close" function that would be nice to know.

Hrovatin commented 2 years ago

@ivirshup any updates on this? Alternatively, is there a good way to check if anndata file is open somewhere before saving?

ivirshup commented 2 years ago

@Hrovatin, no updates yet. But reading this with fresh eyes, I might have something.

We could check and see if there's a file ref open, if h5py allows that

Hrovatin commented 2 years ago

Before I have a adata saved on disk of size N Gb. After this the adata file becomes of size 0 Gb and you can no longer do anything with it - e.g. all data is gone, as far as I understand.

ivirshup commented 2 years ago

Looking some more and found an old issue I opened: https://github.com/h5py/h5py/issues/1864

It looks like this is basically an upstream bug (link to jira tracker, a little difficult to use), and it's unclear if/ when it will be fixed. It looks like there isn't a clean solution from python.


One thing that could work is if the backed file was opened with h5py.File(..., locking=False). Then the file should just be overwritten.

I'm not sure if this is a good default, since it changes behavior from hdf5. Ideally we would error before any data gets truncated. However, we could allow passing that argument through when reading a file in backed mode.

Hrovatin commented 2 years ago

Whatever you can do to prevent deletion of files would be very useful for me, especially now when we no longer have snapshots on the server. Because sometimes I forget to shut down a notebook where I have something backed opened and then data is gone.

ivirshup commented 2 years ago

If you want to just prevent deletion in general, we could let you pass "w-" or "x" to write_h5ad. But this is specifically if you don't want to overwrite data regardless of whether any other process has it open.

Hrovatin commented 2 years ago

No, I would like to overwrite files, but prevent that they get corrupted if another process has it open. - I often update adatas so I want to change (rewrite) existing object on disk.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

flying-sheep commented 1 year ago

Seems like this is a valid use case, and we’re just not clear how to enable it.

Or is something still unclear about this?