Open Hrovatin opened 3 years ago
Could you share an example of what code you ran?
My impression was hdf5 generally prevented modifying a file other processes were reading.
Do this in one terminal/notebook:
>>> import anndata
>>> import scanpy as sc
>>> import os
>>> import pandas as pd
# Make the file
>>> a=anndata.AnnData(pd.DataFrame([[1,2],[3,4]]))
>>> a.write('temp.h5ad')
# Open file in backed mode
>>> a=sc.read('temp.h5ad',backed='r')
# Check file size - file is not empty based on st_size
>>> os.stat('temp.h5ad')
os.stat_result(st_mode=33188, st_ino=5418427297, st_dev=47, st_nlink=1, st_uid=141653, st_gid=20000, st_size=11048, st_atime=1616392873, st_mtime=1616392873, st_ctime=1616392873)
Keep the above running, do the below in another terminal/notebook:
>>> import scanpy as sc
>>> import os
# Open the adata file again and do some work on it, during this the file is still ok (see os.stats st_size)
>>> a=sc.read('temp.h5ad')
>>> a.obs_names=['a','b']
>>> os.stat('temp.h5ad')
os.stat_result(st_mode=33188, st_ino=5418427297, st_dev=47, st_nlink=1, st_uid=141653, st_gid=20000, st_size=11048, st_atime=1616392873, st_mtime=1616392873, st_ctime=1616392873)
# Now try to save the adata to file - the same as is opened in another terminal in basked mode
>>> a.write('temp.h5ad')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/anndata/_core/anndata.py", line 1846, in write_h5ad
_write_h5ad(
File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/anndata/_io/h5ad.py", line 87, in write_h5ad
with h5py.File(filepath, mode) as f:
File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/h5py/_hl/files.py", line 406, in __init__
fid = make_fid(name, mode, userblock_size,
File "/home/icb/karin.hrovatin/miniconda3/envs/rpy2_3/lib/python3.8/site-packages/h5py/_hl/files.py", line 179, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 108, in h5py.h5f.create
OSError: Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
# Now st_size drops to 0
>>> os.stat('temp.h5ad')
os.stat_result(st_mode=33188, st_ino=5418427297, st_dev=47, st_nlink=1, st_uid=141653, st_gid=20000, st_size=0, st_atime=1616393032, st_mtime=1616393032, st_ctime=1616393032)
So it looks like this can be reproduced without anndata
, just h5py
. It's weird to me that it throws an error but also deletes the file (or at least all the contents, but the filesystem still has the path). I can follow this up upstream.
@Hrovatin, I'd check out the issue at h5py for more info. As it pertains to AnnData though, what exactly were you trying to do?
Did you want the file to be overwritten by the second process, or were you looking for it to update it? E.g. would you expect one of those processes to error?
I had file open in one process in basked mode. I then decided that I need to make some stuff in another notebook and forgot about having the file open in the first notebook. When I tried to save the file from the second notebook it corrupted it. I would expect it at least to err out before corrupting it, so that I could close the file in the first notebook before trying to save it in the second notebook. I think not letting you modify a file while it is opened as basked somewhere else is the safest option.
Any update on this? yet another of my adatas was deleted this way.
@ivirshup Maybe a temporary solution would be info on how to close a backed file - now I usually just copy info from adata and delete the object immediately, but if there was a "close" function that would be nice to know.
@ivirshup any updates on this? Alternatively, is there a good way to check if anndata file is open somewhere before saving?
@Hrovatin, no updates yet. But reading this with fresh eyes, I might have something.
We could check and see if there's a file ref open, if h5py allows that
Before I have a adata saved on disk of size N Gb. After this the adata file becomes of size 0 Gb and you can no longer do anything with it - e.g. all data is gone, as far as I understand.
Looking some more and found an old issue I opened: https://github.com/h5py/h5py/issues/1864
It looks like this is basically an upstream bug (link to jira tracker, a little difficult to use), and it's unclear if/ when it will be fixed. It looks like there isn't a clean solution from python.
One thing that could work is if the backed file was opened with h5py.File(..., locking=False)
. Then the file should just be overwritten.
I'm not sure if this is a good default, since it changes behavior from hdf5. Ideally we would error before any data gets truncated. However, we could allow passing that argument through when reading a file in backed mode.
Whatever you can do to prevent deletion of files would be very useful for me, especially now when we no longer have snapshots on the server. Because sometimes I forget to shut down a notebook where I have something backed opened and then data is gone.
If you want to just prevent deletion in general, we could let you pass "w-"
or "x"
to write_h5ad
. But this is specifically if you don't want to overwrite data regardless of whether any other process has it open.
No, I would like to overwrite files, but prevent that they get corrupted if another process has it open. - I often update adatas so I want to change (rewrite) existing object on disk.
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
Seems like this is a valid use case, and we’re just not clear how to enable it.
Or is something still unclear about this?
When I open adata in backed='r' mode in one script and then I try to modify the same adata/file from another script the adata file gets corrupted (its size becomes 0). I would expect writing to return an error before corrupting the file.
Versions