A working software environment for lag-llama

hohe12ly commented 9 months ago

I reported in another issue that the most recent pytorch-lightning does not work with lag-llama. I also tried a few version combinations among pytorch, pytorch-lightning, and gluonts. Eventually I could get the code run for 385 epochs with the following requirements.txt:

orjson
torch==2.0.0
gluonts==0.13.5
pytorch-lightning==1.9.5
datasets
xformers
git+https://github.com/kashif/hopfield-layers@pytorch-2
etsformer-pytorch
reformer_pytorch
einops
opt_einsum
pykeops
scipy
apex
git+https://github.com/microsoft/torchscale

But the run still failed due to a divide-by-zero error in gluonts. Before I try more, I thought it'd be more efficient to ask the question here: could you share a working requirements.txt with version number specified?

BTW, the error I got with my requirements.txt is:

Epoch 385: : 110it [00:23,  4.66it/s, loss=-0.64, v_num=0, val_loss=-.690, train_loss=-1.10]Epoch 385, global step 38600: 'val_loss' was not in top 1

Epoch 385: : 110it [00:23,  4.65it/s, loss=-0.64, v_num=0, val_loss=-.690, train_loss=-1.10]
Use checkpoint: /home/lagllama_test/test/pytorch-transformer-ts/lag-llama/model-size-scaling-logs/0/experiments/lightning_logs/version_0/checkpoints/epoch=335-step=33600.ckpt
Predict on m4_weekly
m4_weekly prediction length: 13

Running evaluation:   0%|          | 0/359 [00:00<?, ?it/s]
Running evaluation: 100%|██████████| 359/359 [00:00<00:00, 81024.28it/s]
logger.log_dir :  /home/lagllama_test/test/pytorch-transformer-ts/lag-llama/model-size-scaling-logs/0/experiments/lightning_logs/version_0
os.path.exists(logger.log_dir) :  True
Predict on traffic
traffic prediction length: 24

Running evaluation:   0%|          | 0/6034 [00:00<?, ?it/s]
Running evaluation: 100%|██████████| 6034/6034 [00:00<00:00, 1090128.80it/s]
/home/lagllama_test/conda/envs/lagllama/lib/python3.10/site-packages/gluonts/evaluation/_base.py:422: RuntimeWarning: divide by zero encountered in scalar divide
  metrics["ND"] = cast(float, metrics["abs_error"]) / cast(
/home/lagllama_test/conda/envs/lagllama/lib/python3.10/site-packages/gluonts/evaluation/_base.py:422: RuntimeWarning: divide by zero encountered in scalar divide
  metrics["ND"] = cast(float, metrics["abs_error"]) / cast(
/home/lagllama_test/conda/envs/lagllama/lib/python3.10/site-packages/gluonts/evaluation/_base.py:422: RuntimeWarning: divide by zero encountered in scalar divide
  metrics["ND"] = cast(float, metrics["abs_error"]) / cast(
/home/lagllama_test/conda/envs/lagllama/lib/python3.10/site-packages/gluonts/evaluation/_base.py:422: RuntimeWarning: divide by zero encountered in scalar divide
  metrics["ND"] = cast(float, metrics["abs_error"]) / cast(
/home/lagllama_test/conda/envs/lagllama/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:134: UserWarning: Warning: converting a masked element to nan.
  return arr.astype(dtype, copy=True)

Thanks a lot.

Yan

ashok-arjun commented 9 months ago

Hey @hohe12ly, I'll look into this soon and get back!

hohe12ly commented 9 months ago

BTW, my software environment hasCentOS 8 and NVIDIA V100S-PCIE-32GB GPUs, if this info would be helpful.

ashok-arjun commented 8 months ago

Hi, sorry for the delay.

Seems like you don't get errors, you only get warnings. Those warnings are normal.

Anyway, I use the following for my requirements.txt

gluonts==0.13.3
numpy==1.23.5
pytorch_lightning==2.0.4
torch==2.0.0+cu118
wandb
scipy

hohe12ly commented 8 months ago

Thanks, Arjun. I tested your configuration. It works. I still see the divide by zero warnings. As you mentioned, it's normal.

Since torch=2.0.0+cu118 no longer works on pip's default index server, I had to modify requirements.txt for pip install:

--index-url https://download.pytorch.org/whl/cu118
--extra-index-url https://pypi.org/simple
torch==2.0.0
torchvision==0.15.1
torchaudio==2.0.1
numpy==1.23.5
gluonts==0.13.3
pytorch_lightning==2.0.4
datasets
xformers
git+https://github.com/kashif/hopfield-layers@pytorch-2
etsformer-pytorch
reformer_pytorch
einops
opt_einsum
pykeops
scipy
apex
git+https://github.com/microsoft/torchscale
wandb
orjson

kashif / pytorch-transformer-ts

A working software environment for lag-llama #29