usnistgov / alignn

Atomistic Line Graph Neural Network https://scholar.google.com/citations?user=9Q-tNnwAAAAJ&hl=en
https://jarvis.nist.gov/jalignn/
Other
219 stars 80 forks source link

Does ALIGNNTL works with ALIGNN 2024.2.4 #155

Open antonf-ekb opened 3 months ago

antonf-ekb commented 3 months ago

Dear developers! Actually, my request belongs to this reporistory for the ALIGNN transfer learning project, but since I have not received a reply on the issue there, I thought I could try to obtain it here. I have an installed and proper working ALIGNN v. 2024.2.4. I cloned the ALIGNNTL repository and trying to reproduce the FineTuning example, first of all, I face that the train_folder.py script (which is suggested to run ) does not contain all_models = {...} required for the TL. Then I found that train.py script contains this code, so I tried to run

python alignn/train.py --root_dir "../examples" --config "../examples/config_example.json" --id_prop_file "id_prop.csv" --output_dir=model

but get the following errors

from .named_optimizer import _NamedOptimizer File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/distributed/optim/named_optimizer.py", line 11, in <module> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/distributed/fsdp/__init__.py", line 1, in <module> from ._flat_param import FlatParameter as FlatParameter File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 30, in <module> from torch.distributed.fsdp._common_utils import ( File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/distributed/fsdp/_common_utils.py", line 35, in <module> from torch.distributed.fsdp._fsdp_extensions import FSDPExtensions File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/distributed/fsdp/_fsdp_extensions.py", line 8, in <module> from torch.distributed._tensor import DeviceMesh, DTensor File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/distributed/_tensor/__init__.py", line 6, in <module> import torch.distributed._tensor.ops File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/distributed/_tensor/ops/__init__.py", line 2, in <module> from .embedding_ops import * # noqa: F403 File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/distributed/_tensor/ops/embedding_ops.py", line 8, in <module> import torch.distributed._functional_collectives as funcol File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/distributed/_functional_collectives.py", line 12, in <module> from . import _functional_collectives_impl as fun_col_impl File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/distributed/_functional_collectives_impl.py", line 36, in <module> from torch._dynamo import assume_constant_result File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 2, in <module> from . import convert_frame, eval_frame, resume_execution File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 40, in <module> from . import config, exc, trace_rules File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/_dynamo/exc.py", line 11, in <module> from .utils import counters File "/home/anton/miniconda3/envs/alignn/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 5, in <module> import cProfile File "/home/anton/miniconda3/envs/alignn/lib/python3.10/cProfile.py", line 23, in <module> run.__doc__ = _pyprofile.run.__doc__ AttributeError: module 'profile' has no attribute 'run'

From the setup.py I see that the expected version of ALIGNN is 2021.11.16, but is it mandatory and this is the cause of the problems? If yes, aren't you going to update the ALIGNNTL code to support the actual version of ALIGNN?

Best regards, Anton.

bdecost commented 3 months ago

hi - I think the ALIGNNTL project is more of a fork of ALIGNN than a project that depends on this repository. There is a copy of the alignn source tree under ALIGNNTL/FineTuning

so to run the demo that you are asking about, I think you should do something like this

git clone https://github.com/NU-CUCIS/ALIGNNTL/FineTuning
cd ALIGNNTL/FineTuning
python alignn/train.py --root_dir "../examples" --config "../examples/config_example.json" --id_prop_file "id_prop.csv" --output_dir=model

this would run the training script in the ALIGNNTL repository (https://github.com/NU-CUCIS/ALIGNNTL/blob/main/FineTuning/alignn/train.py), not the training entry point in the v2024.2.4 ALIGNN that you have installed

maybe @GuptaVishu2002 can confirm

antonf-ekb commented 3 months ago

Hi, thank you for your answer! However, when I'm trying to run the train.py as you suggest, it requires ALIGNN 2021.11.16 as can be seen from here And trying to install this version of ALIGNN, it requires dgl-cu101:

ERROR: Could not find a version that satisfies the requirement dgl-cu101>=0.6.0 (from alignn) (from versions: none)
ERROR: No matching distribution found for dgl-cu101>=0.6.0

which I wasn't able to install neither using pip nor conda. Best regards, Anton.

bdecost commented 3 months ago

if you mean this line, that is not a package dependency, it's the version of alignn that this project was forked from

from what I can tell, ALIGNNTL is not really meant to be an installable package, it's more research code that seems meant to be run directly from the source directory

I think if you want to reproduce the ALIGNNTL methods the most straightforward thing would be to create a fresh environment and install the dependencies from the ALIGNNTL list - I recommend first installing the stable releases of PyTorch and DGL following the instructions for each of those packages to ensure you get the right CUDA versions and such, then install the other dependencies with

python -m pip install numpy scipy Jarvis-tools scikit-learn matplotlib tqdm pandas pytorch-ignite "pydantic<2.0" flake8 pycodestyle pydocstyle pyparsing

you have to pin pydantic to v1 since pydantic 2.x has breaking changes that affect the version of alignn this project is forked from

don't install alignn in this repository since you will be directly running the ALIGNNTL version https://github.com/NU-CUCIS/ALIGNNTL/blob/main/FineTuning/alignn/train.py

then you can run the example like I suggested before. I had to make sure ALIGNNTL/FineTuning was on my PYTHONPATH to run this

git clone https://github.com/NU-CUCIS/ALIGNNTL
cd ALIGNNTL/FineTuning
PYTHONPATH=. python alignn/train.py --root_dir "../examples" --config "../examples/config_example.json" --id_prop_file "id_prop.csv" --output_dir=model

hopefully that works for you, the alternative is a bit more work of figuring out the cleanest way to incorporate the functionality of this project back into the upstream alignn repo

antonf-ekb commented 3 months ago

Thank you for the tip regarding PYTHONPATH, running the script in this way indeed does not require ALIGNN to be installed. I tried to create a completely fresh environment and installed the required packages, however after running the script, it ends up with the following error Current run is terminating due to exception: /opt/dgl/src/runtime/c_runtime_api.cc:88: Check failed: allow_missing: Device API gpu is not enabled. Please install the cuda version of dgl. [bt] (0) /home/anton/miniconda3/envs/alignn_tl/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f16d9c1713f] [bt] (1) /home/anton/miniconda3/envs/alignn_tl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::DeviceAPIManager::GetAPI(std::string, bool)+0x374) [0x7f16da2f45c4] [bt] (2) /home/anton/miniconda3/envs/alignn_tl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::DeviceAPI::Get(DLContext, bool)+0x1f4) [0x7f16da2ee2a4] [bt] (3) /home/anton/miniconda3/envs/alignn_tl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::Empty(std::vector<long, std::allocator<long> >, DLDataType, DLContext)+0x334) [0x7f16da30fe54] [bt] (4) /home/anton/miniconda3/envs/alignn_tl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::CopyTo(DLContext const&) const+0xc0) [0x7f16da346cc0] [bt] (5) /home/anton/miniconda3/envs/alignn_tl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::aten::COOMatrix::CopyTo(DLContext const&) const+0x7d) [0x7f16da43734d] [bt] (6) /home/anton/miniconda3/envs/alignn_tl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::UnitGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DLContext const&)+0x292) [0x7f16da427a02] [bt] (7) /home/anton/miniconda3/envs/alignn_tl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::HeteroGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DLContext const&)+0xf5) [0x7f16da357ee5] [bt] (8) /home/anton/miniconda3/envs/alignn_tl/lib/python3.8/site-packages/dgl/libdgl.so(+0x9653ab) [0x7f16da3653ab]

However, if I run pip list | grep dgl it gives dgl 0.6.0 dgl-cu101 0.6.1 so it looks like the cuda version of DGL is installed. What is also not clear to me, why in manifesting the error it refers to this path /opt/dgl/ and after it to home/anton/miniconda3/envs/alignn_tl/lib/python3.8/site-packages/dgl/ .

In addition, I've checked the availability of CUDA by running torch.cuda.is_available() which returned True

bdecost commented 3 months ago

I'm not sure what to make of the /opt path, does /opt/dgl/ exist on your system? Maybe it's something to do with the system that the package was compiled on?

I noticed that it seems like you have both dgl and dgl-cu101 installed, maybe it is loading the first one? The point releases differ so you could check which is loaded by printing dgl.__version__

If that's the problem, it might work if you remove the CPU version of the package.

If you want to use more current library versions, I would recommend installing the stable PyTorch with conda first, then installing the stable version of dgl with the matching CUDA version

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
conda install -c dglteam/label/th23_cu118 dgl

then install the other dependencies

I've found that it can take a bit of experimenting to make sure the PyTorch, dgl, and CUDA versions are all compatible, and sometimes there are point releases of dgl that you have to avoid so you might have to downgrade dgl by a few point releases. I have an environment that I know works with torch 2.1.2 and dgl 2.0.0+cu118, but I don't think I've tried the current dgl release yet