Open tft225 opened 6 days ago
Hi @tft225
Thank you for your interest in NequIP. On my WSL, the following works.
git clone https://github.com/mir-group/nequip.git
cd nequip
conda create -n nequip python=3.11
conda activate nequip
pip install torch
pip install -e .
pip install wandb
nequip-train configs/example.yaml
doing nequip-train configs/minimal.yaml
also works.
Could you maybe share more about the exact steps you took - that might be helpful for us to figure it out. Also, it could be useful to delete the training directory when you want to start a new training run (potentially could help with your debugging process to distinguish different factors at play). Looking at the stack trace
File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\savenload.py", line 25, in _delete_files_if_exist
f.unlink(missing_ok=True)
File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\pathlib.py", line 1325, in unlink
self._accessor.unlink(self)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Thoma\AppData\Local\Temp\tmpq1_2yq_q'
it's related to pathlib
- what version of pathlib (and python) were you using for one of the cases that failed with such an error?
The last time I tried I was on Python 3.9.19. It would have been whatever automatically installed when I installed nequip; I've since removed the environment, sorry. I can try again and check the version if that's necessary.
My steps were as follows the last time I tried: Create new environment in anaconda navigator (python 3.9.19)
Install the cmd.exe prompt in that environment from anaconda navigator
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113 (following instructions on pytorch website)
Installed Cuda Toolkit 11.3 and set this version in system environment variables (for compatibility with this torch version)
git clone https://github.com/mir-group/nequip.git cd nequip pip install .
pip install wandb
conda install numpy 1.20.3 (got a different warning using the latest version of numpy)
Restarted my computer (I think this is needed for the cuda driver change to take effect but I'm not sure)
nequip-train configs/minimal.yaml
I also tried different python versions and torch versions and kept running into the same issue, this method just fixed a separate problem where my gpu wasn't recognized.
After following the steps specified above I think I ran into this error:
c:\users\thoma\nequip\nequip\utils_global_options.py:59: UserWarning: !! Upstream issues in PyTorch versions >1.11 have been seen to cause unusual performance degredations on some CUDA systems that become worse over time; see https://github.com/mir-group/nequip/discussions/311. At present we strongly recommend the use of PyTorch 1.11 if using CUDA devices; while using other versions if you observe this problem, an unexpected lack of this problem, or other strange behavior, please post in the linked GitHub issue.
warnings.warn(
Traceback (most recent call last):
File "\?\C:\Users\Thoma\anaconda3\envs\nequip\Scripts\nequip-train-script.py", line 33, in
when running example.yaml, and the same error as before for minimal.yaml. This looks like the first one has something to do with how wandb installed?
Thanks for the quick reply, I appreciate your help.
I'm not sure of this exact error, but in general we do not support Windows systems except inside of Windows Subsystem for Linux (WSL).
Describe the bug When trying to run the minimal.yaml test set, I get this out:
Processing dataset... Loaded data: Batch(atomic_numbers=[21000, 1], batch=[21000], cell=[1000, 3, 3], edge_cell_shift=[220186, 3], edge_index=[2, 220186], forces=[21000, 3], pbc=[1000, 3], pos=[21000, 3], ptr=[1001], total_energy=[1000, 1]) processed data size: ~9.77 MB Traceback (most recent call last): File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\shutil.py", line 791, in move os.rename(src, real_dst) PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Thoma\AppData\Local\Temp\tmpq1_2yq_q' -> 'results\aspirin\processed_dataset_afe51556e8a832da62377bded6857e80a9523c1b\.tmp-data.pth~'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\savenload.py", line 39, in _process_moves shutil.move(from_name, tmp_path) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\shutil.py", line 812, in move os.unlink(src) PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Thoma\AppData\Local\Temp\tmpq1_2yq_q'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\auto_init.py", line 243, in instantiate instance = builder(positional_args, final_optional_args) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\data_dataset_npz_dataset.py", line 81, in init super().init( File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\data_dataset_base_datasets.py", line 152, in init super().init(root=root, type_mapper=type_mapper) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\data_dataset_base_datasets.py", line 43, in init super().init(root=root, transform=type_mapper) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\torch_geometric\dataset.py", line 91, in init self._process() File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\torch_geometric\dataset.py", line 176, in _process self.process() File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\data_dataset_base_datasets.py", line 280, in process torch.save((data, self.include_frames), f) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\contextlib.py", line 120, in exit next(self.gen) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\savenload.py", line 182, in atomic_write _submit_move(Path(tp.name), Path(fname), blocking=blocking) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\savenload.py", line 128, in _submit_move _process_moves([obj]) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\savenload.py", line 43, in _process_moves _delete_files_if_exist([m[1] for m in moves]) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\savenload.py", line 25, in _delete_files_if_exist f.unlink(missing_ok=True) File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\pathlib.py", line 1325, in unlink self._accessor.unlink(self) PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Thoma\AppData\Local\Temp\tmpq1_2yq_q'
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\Thoma\anaconda3\envs\NequIP\Scripts\nequip-train.exe__main__.py", line 7, in
File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\scripts\train.py", line 83, in main
trainer = fresh_start(config)
File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\scripts\train.py", line 196, in fresh_start
dataset = dataset_from_config(config, prefix="dataset")
File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\data_build.py", line 78, in dataset_fromconfig
instance, = instantiate(
File "C:\Users\Thoma\anaconda3\envs\NequIP\lib\site-packages\nequip\utils\auto_init.py", line 245, in instantiate
raise RuntimeError(
RuntimeError: Failed to build object with prefix
dataset
using builderNpzDataset
To Reproduce Try to install NequIP in a new environment created through anaconda navigator and run the given test: $ nequip-train configs/example.yaml
Expected behavior runs minimal.yaml training
Environment:
Additional context tried multiple different new environments with different python/torch/cuda versions created through anaconda, all showed the same issue