tbmalt / tbmalt

Tight Binding Machine Learning Toolkit
GNU Lesser General Public License v3.0
35 stars 10 forks source link

Examples not working beyond example 01 on development branch #53

Open jarvist opened 3 weeks ago

jarvist commented 3 weeks ago

Once we had a working Conda install of TBMaLT, we can run examples/example_01/example_01.py

However, examples/example_01/example_02.py fails with an H5 error, trying to look up a key; as do the rest of the examples.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
example_02.py in line 89
     86 # 2.1: Target system specific objects
     87 # -----------------------------------
     88 if fit_model:
---> 89     dataloder = load_target_data(target_path, sources, targets)
     90 else:
     91     raise NotImplementedError()

example_02.py in line 79, in load_target_data(path, sources, targets)
     77 for sou in sources:
     78     _sources.extend([sou + '/' + i for i in (f[sou].keys())])
---> 79 return DataSetIM.load_data(path, _sources, targets)

File \tbmalt\tbmalt\io\dataset.py:198, in DataSetIM.load_data(cls, path, sources, targets, pbc, device)
    192 geometry = reduce(
    193     operator.add,
    194     [_load_structure(database[source], pbc=pbc, device=device)
    195      for source in sources])
    197 # Load and pack the requested target datasets from each system.
--> 198 data = {
    199     target_name: pack([
    200         torch.tensor(database[join(source, target)],
    201                      device=device)
    202         for source in sources]
    203     ) for target_name, target in targets.items()}
    205 if 'label' in database[sources[0]].attrs:
    206     labels = [database[source].attrs['label']
    207               for source in sources]

File \tbmalt\tbmalt\io\dataset.py:199, in <dictcomp>(.0)
    192 geometry = reduce(
    193     operator.add,
    194     [_load_structure(database[source], pbc=pbc, device=device)
    195      for source in sources])
    197 # Load and pack the requested target datasets from each system.
    198 data = {
--> 199     target_name: pack([
    200         torch.tensor(database[join(source, target)],
    201                      device=device)
    202         for source in sources]
    203     ) for target_name, target in targets.items()}
    205 if 'label' in database[sources[0]].attrs:
    206     labels = [database[source].attrs['label']
    207               for source in sources]

File \tbmalt\tbmalt\io\dataset.py:200, in <listcomp>(.0)
    192 geometry = reduce(
    193     operator.add,
    194     [_load_structure(database[source], pbc=pbc, device=device)
    195      for source in sources])
    197 # Load and pack the requested target datasets from each system.
    198 data = {
    199     target_name: pack([
--> 200         torch.tensor(database[join(source, target)],
    201                      device=device)
    202         for source in sources]
    203     ) for target_name, target in targets.items()}
    205 if 'label' in database[sources[0]].attrs:
    206     labels = [database[source].attrs['label']
    207               for source in sources]

File h5py\_objects.pyx:54, in h5py._objects.with_phil.wrapper()

File h5py\_objects.pyx:55, in h5py._objects.with_phil.wrapper()

File c:\ProgramData\Anaconda3\lib\site-packages\h5py\_hl\group.py:328, in Group.__getitem__(self, name)
    326         raise ValueError("Invalid HDF5 object reference")
    327 elif isinstance(name, (bytes, str)):
--> 328     oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
    329 else:
    330     raise TypeError("Accessing a group is done with bytes or str, "
    331                     " not {}".format(type(name)))

File h5py\_objects.pyx:54, in h5py._objects.with_phil.wrapper()

File h5py\_objects.pyx:55, in h5py._objects.with_phil.wrapper()

File h5py\h5o.pyx:190, in h5py.h5o.open()

KeyError: "Unable to open object (object 'CHHHH0\\dipole' doesn't exist)"

Are the example .h5 files consisted with the latest version of the code? The H5 files appear to contain the correct keys. We are initially running on Windows.

WbSun723 commented 2 weeks ago

@jarvist Hi there, thanks a lot for your feedback. I will have a look into this bug.

jarvist commented 2 weeks ago

I think we've managed to sort it ourself. On Windows a lot of them fail because the h5py package fails to have a suitable binary dependency, but then it does not fail with a useful error message. (Perhaps a test that the .h5 file is being opened properly would be good to add to the examples?)

On Linux we found that downgrading the version of Python was able to build a working h5py package.

On our Debian machine, this required a Conda install of: python=3.8 pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=12.1

So a little bit suboptimal in terms of error messages, but resolved!

WbSun723 commented 1 week ago

Hi @jarvist,

Thanks a lot for your valuable feedback!

Currently we only test the code on Linux machine, and the environment for testing can be found in .github/workflows/ci.yml, which is python-version: ["3.8", "3.9", "3.10"] with the following settings for packages:

      python -m pip install --upgrade pip
      pip install pytest h5py ase typing 'pydantic>=1.10.0,<2.0.0' tomli dscribe 
      pip3 install torch==1.12.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

And in the latest pull request, the version of python and pytorch will be updated to python-version: ["3.11", "3.12"], with

      python -m pip install --upgrade pip
      pip install pytest h5py ase typing 'pydantic>=1.10.0,<2.0.0' tomli dscribe 
      pip3 install torch==2.3.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

Thanks again for your report. We may test other machines in the future and we will add documentation for the environment setting for the released version.

Best, Wenbo

mcsloy commented 1 week ago

Hello @jarvist. I have taken a look at this and it seems to be a operating system (Windows) specific bug. The Python hdf5 package requires data paths to be POSIX compliant like so "this/path/is/posix/compliant". However, windows uses backslashes as path component separators like so "this\path\is\not\posix\compliant". The os.path.join method used to format paths will default to the standard specified by the operating system. So when the example is run on Windows the paths produced by the os.join operation will not be POSIX compliant and will result in the h5py package getting rather upset. Hence it stating that it cannot resolve the path "'CHHHH0\\dipole'". We can patch this by using the posixpath.join method rather than os.path.join for HDF5 related pathing.