mir-group / nequip

NequIP is a code for building E(3)-equivariant interatomic potentials
https://www.nature.com/articles/s41467-022-29939-5
MIT License
564 stars 124 forks source link

New installation will not run- "Failed to build object with prefix `dataset` using builder `NpzDataset`"🐛 [BUG] #440

Open tft225 opened 6 days ago

tft225 commented 6 days ago

Describe the bug When trying to run the minimal.yaml test set, I get this out:

Processing dataset... Loaded data: Batch(atomic_numbers=[21000, 1], batch=[21000], cell=[1000, 3, 3], edge_cell_shift=[220186, 3], edge_index=[2, 220186], forces=[21000, 3], pbc=[1000, 3], pos=[21000, 3], ptr=[1001], total_energy=[1000, 1]) processed data size: ~9.77 MB Traceback (most recent call last): File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\shutil.py", line 791, in move os.rename(src, real_dst) PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Thoma\AppData\Local\Temp\tmpq1_2yq_q' -> 'results\aspirin\processed_dataset_afe51556e8a832da62377bded6857e80a9523c1b\.tmp-data.pth~'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\savenload.py", line 39, in _process_moves shutil.move(from_name, tmp_path) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\shutil.py", line 812, in move os.unlink(src) PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Thoma\AppData\Local\Temp\tmpq1_2yq_q'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\auto_init.py", line 243, in instantiate instance = builder(positional_args, final_optional_args) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\data_dataset_npz_dataset.py", line 81, in init super().init( File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\data_dataset_base_datasets.py", line 152, in init super().init(root=root, type_mapper=type_mapper) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\data_dataset_base_datasets.py", line 43, in init super().init(root=root, transform=type_mapper) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\torch_geometric\dataset.py", line 91, in init self._process() File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\torch_geometric\dataset.py", line 176, in _process self.process() File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\data_dataset_base_datasets.py", line 280, in process torch.save((data, self.include_frames), f) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\contextlib.py", line 120, in exit next(self.gen) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\savenload.py", line 182, in atomic_write _submit_move(Path(tp.name), Path(fname), blocking=blocking) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\savenload.py", line 128, in _submit_move _process_moves([obj]) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\savenload.py", line 43, in _process_moves _delete_files_if_exist([m[1] for m in moves]) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\savenload.py", line 25, in _delete_files_if_exist f.unlink(missing_ok=True) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\pathlib.py", line 1325, in unlink self._accessor.unlink(self) PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Thoma\AppData\Local\Temp\tmpq1_2yq_q'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\Thoma\anaconda3\envs\NequIP\Scripts\nequip-train.exe__main__.py", line 7, in File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\scripts\train.py", line 83, in main trainer = fresh_start(config) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\scripts\train.py", line 196, in fresh_start dataset = dataset_from_config(config, prefix="dataset") File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\data_build.py", line 78, in dataset_fromconfig instance, = instantiate( File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\auto_init.py", line 245, in instantiate raise RuntimeError( RuntimeError: Failed to build object with prefix dataset using builder NpzDataset

To Reproduce Try to install NequIP in a new environment created through anaconda navigator and run the given test: $ nequip-train configs/example.yaml

Expected behavior runs minimal.yaml training

Environment:

Additional context tried multiple different new environments with different python/torch/cuda versions created through anaconda, all showed the same issue

cw-tan commented 6 days ago

Hi @tft225

Thank you for your interest in NequIP. On my WSL, the following works.

git clone https://github.com/mir-group/nequip.git
cd nequip
conda create -n nequip python=3.11
conda activate nequip
pip install torch
pip install -e .
pip install wandb
nequip-train configs/example.yaml

doing nequip-train configs/minimal.yaml also works.

Could you maybe share more about the exact steps you took - that might be helpful for us to figure it out. Also, it could be useful to delete the training directory when you want to start a new training run (potentially could help with your debugging process to distinguish different factors at play). Looking at the stack trace

File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\savenload.py", line 25, in _delete_files_if_exist
f.unlink(missing_ok=True)
File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\pathlib.py", line 1325, in unlink
self._accessor.unlink(self)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Thoma\AppData\Local\Temp\tmpq1_2yq_q'

it's related to pathlib - what version of pathlib (and python) were you using for one of the cases that failed with such an error?

tft225 commented 6 days ago

The last time I tried I was on Python 3.9.19. It would have been whatever automatically installed when I installed nequip; I've since removed the environment, sorry. I can try again and check the version if that's necessary.

My steps were as follows the last time I tried: Create new environment in anaconda navigator (python 3.9.19)

Install the cmd.exe prompt in that environment from anaconda navigator

pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113 (following instructions on pytorch website)

Installed Cuda Toolkit 11.3 and set this version in system environment variables (for compatibility with this torch version)

git clone https://github.com/mir-group/nequip.git cd nequip pip install .

pip install wandb

conda install numpy 1.20.3 (got a different warning using the latest version of numpy)

Restarted my computer (I think this is needed for the cuda driver change to take effect but I'm not sure)

nequip-train configs/minimal.yaml

I also tried different python versions and torch versions and kept running into the same issue, this method just fixed a separate problem where my gpu wasn't recognized.

After following the steps specified above I think I ran into this error:

c:\users\thoma\nequip\nequip\utils_global_options.py:59: UserWarning: !! Upstream issues in PyTorch versions >1.11 have been seen to cause unusual performance degredations on some CUDA systems that become worse over time; see https://github.com/mir-group/nequip/discussions/311. At present we strongly recommend the use of PyTorch 1.11 if using CUDA devices; while using other versions if you observe this problem, an unexpected lack of this problem, or other strange behavior, please post in the linked GitHub issue. warnings.warn( Traceback (most recent call last): File "\?\C:\Users\Thoma\anaconda3\envs\nequip\Scripts\nequip-train-script.py", line 33, in sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\users\thoma\nequip\nequip\scripts\train.py", line 83, in main trainer = fresh_start(config) ^^^^^^^^^^^^^^^^^^^ File "c:\users\thoma\nequip\nequip\scripts\train.py", line 182, in fresh_start config = init_n_update(config) ^^^^^^^^^^^^^^^^^^^^^ File "c:\users\thoma\nequip\nequip\utils\wandb.py", line 23, in init_n_update wandb.init( File "C:\Users\Thoma\anaconda3\envs\nequip\Lib\site-packages\wandb\sdk\wandb_init.py", line 1195, in init wandb._sentry.reraise(e) File "C:\Users\Thoma\anaconda3\envs\nequip\Lib\site-packages\wandb\analytics\sentry.py", line 155, in reraise raise exc.with_traceback(sys.exc_info()[2]) File "C:\Users\Thoma\anaconda3\envs\nequip\Lib\site-packages\wandb\sdk\wandb_init.py", line 1180, in init wi.setup(kwargs) File "C:\Users\Thoma\anaconda3\envs\nequip\Lib\site-packages\wandb\sdk\wandb_init.py", line 189, in setup self._wl = wandb_setup.setup(settings=setup_settings) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Thoma\anaconda3\envs\nequip\Lib\site-packages\wandb\sdk\wandb_setup.py", line 325, in setup ret = _setup(settings=settings) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Thoma\anaconda3\envs\nequip\Lib\site-packages\wandb\sdk\wandb_setup.py", line 318, in _setup wl = _WandbSetup(settings=settings) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Thoma\anaconda3\envs\nequip\Lib\site-packages\wandb\sdk\wandb_setup.py", line 303, in init _WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Thoma\anaconda3\envs\nequip\Lib\site-packages\wandb\sdk\wandb_setup.py", line 108, in init self._settings = self._settings_setup(settings, self._early_logger) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Thoma\anaconda3\envs\nequip\Lib\site-packages\wandb\sdk\wandb_setup.py", line 138, in _settings_setup s._infer_run_settings_from_environment(_logger=early_logger) File "C:\Users\Thoma\anaconda3\envs\nequip\Lib\site-packages\wandb\sdk\wandb_settings.py", line 1788, in _infer_run_settings_from_environment program_relpath = self.program_relpath or _get_program_relpath( ^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Thoma\anaconda3\envs\nequip\Lib\site-packages\wandb\sdk\wandb_settings.py", line 188, in _get_program_relpath relative_path = os.path.relpath(full_path_to_program, start=root) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "", line 766, in relpath ValueError: path is on mount '\\?\C:', start on mount 'C:'

when running example.yaml, and the same error as before for minimal.yaml. This looks like the first one has something to do with how wandb installed?

Thanks for the quick reply, I appreciate your help.

Linux-cpp-lisp commented 5 days ago

I'm not sure of this exact error, but in general we do not support Windows systems except inside of Windows Subsystem for Linux (WSL).

cw-tan commented 5 days ago

Yea, you could try setting up WSL (instructions here) and conda (instructions here).