vjcitn / BiocPBG

interfaces to pytorch biggraph
https://vjcitn.github.io/BiocPBG/
0 stars 0 forks source link

can't use on terra, h5py failure with h5py 3.10 or 3.9 #2

Open vjcitn opened 10 months ago

vjcitn commented 10 months ago
  storage = tensor.storage_type()._new_shared(size.numel())
2023-12-18 15:16:16,822   [Trainer-0] Creating workers...
2023-12-18 15:16:16,938   [Trainer-0] Initializing global model...
2023-12-18 15:16:16,943   [Trainer-0] Starting epoch 1 / 5, edge path 1 / 1, edge chunk 1 / 1
2023-12-18 15:16:16,943   [Trainer-0] Edge path: tr
2023-12-18 15:16:16,944   [Trainer-0] still in queue: 0
2023-12-18 15:16:16,944   [Trainer-0] Swapping partitioned embeddings None ( 0 , 0 )
2023-12-18 15:16:16,944   [Trainer-0] Loading partitioned embeddings from checkpoint
Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
  OSError: [Errno 14] Can't read data (file read failed: time = Mon Dec 18 15:16:17 2023
, filename = '/home/rstudio/BiocPBG/inst/scripts/myfold5b/tr/edges_0_0.h5', file descriptor = 144, errno = 14, error message = 'Bad address', buf = 0x7f2ba5c0a000, total read size = 37703808, bytes this sub-read = 37703808, bytes actually read = 18446744073709551615, offset = 0)
Run `reticulate::py_last_error()` for details.
vjcitn commented 10 months ago

── Python Exception Message ───────────────────────────────────────────────────────────────────────────────────────────
Traceback (most recent call last):
  File "/home/rstudio/.local/lib/python3.10/site-packages/torchbiggraph/train.py", line 42, in train
    coordinator.train()
  File "/home/rstudio/.local/lib/python3.10/site-packages/torchbiggraph/train_cpu.py", line 629, in train
    edges = edge_storage.load_chunk_of_edges(
  File "/home/rstudio/.local/lib/python3.10/site-packages/torchbiggraph/graph_storages.py", line 464, in load_chunk_of_edges
    raise err
  File "/home/rstudio/.local/lib/python3.10/site-packages/torchbiggraph/graph_storages.py", line 440, in load_chunk_of_edges
    lhs_ds.read_direct(lhs.numpy(), source_sel=np.s_[begin:end])
  File "/home/rstudio/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py", line 1024, in read_direct
    self.id.read(mspace, fspace, dest, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 242, in h5py.h5d.DatasetID.read
  File "h5py/_proxy.pyx", line 113, in h5py._proxy.dset_rw
OSError: [Errno 14] Can't synchronously read data (file read failed: time = Mon Dec 18 15:02:36 2023
, filename = '/home/rstudio/BiocPBG/inst/scripts/myfold5b/tr/edges_0_0.h5', file descriptor = 144, errno = 14, error message = 'Bad address', buf = 0x7fa729c0a000, total read size = 37703808, bytes this sub-read = 37703808, bytes actually read = 18446744073709551615, offset = 0)

── R Traceback ────────────────────────────────────────────────────────────────────────────────────────────────────────
    ▆
 1. ├─base::source("newemb5.R", echo = TRUE)
 2. │ ├─base::withVisible(eval(ei, envir))
 3. │ └─base::eval(ei, envir)
 4. │   └─base::eval(ei, envir)
 5. └─BiocPBG::sce_to_embeddings(...) at newemb5.R:19:0
 6.   └─BiocPBG::train_eval(list(config = cc), pbg, evind = 1)
 7.     └─pbg$train$train(trc, subprocess_init = si)
 8.       └─reticulate:::py_call_impl(callable, call_args$unnamed, call_args$named)
> q()
rstudio@fb49ef814826:~/BiocPBG/inst/scripts$ /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '