jkulhanek / wild-gaussians

[NeurIPS'24] WildGaussians: 3D Gaussian Splatting In the Wild
https://wild-gaussians.github.io
Other
324 stars 20 forks source link

train error #30

Open luocha0107 opened 1 month ago

luocha0107 commented 1 month ago

I used colmap dataset. (flow_map) root@autodl-container-f34d45a126-5b9fd9ad:~/data_user/ysl/wild-gaussians# nerfbaselines train --method wild-gaussians --data datasets/0729_powertower_radial/ info: Using method: wild-gaussians, backend: python info: Loading train dataset info: Detecting dataset format from path: /root/data_user/ysl/wild-gaussians/datasets/0729_powertower_radial info: Colmap dataloader is using LLFF split with 207 training and 30 test images info: Loaded unknown dataset from path /root/data_user/ysl/wild-gaussians/datasets/0729_powertower_radial using loader colmap info: Loading images from /root/data_user/ysl/wild-gaussians/datasets/0729_powertower_radial/images loading images: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 207/207 [00:40<00:00, 5.14it/s] info: Loading eval dataset info: Detecting dataset format from path: /root/data_user/ysl/wild-gaussians/datasets/0729_powertower_radial info: Colmap dataloader is using LLFF split with 207 training and 30 test images info: Loaded unknown dataset from path /root/data_user/ysl/wild-gaussians/datasets/0729_powertower_radial using loader colmap info: Loading images from /root/data_user/ysl/wild-gaussians/datasets/0729_powertower_radial/images loading images: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:05<00:00, 5.17it/s] warning: Dataset ID not specified, dataset-specific config overrides may not be applied info: Active presets: info: Using config overrides: {} info: Loading config file default.yml info: using MLP layer as FFN Traceback (most recent call last): File "/root/miniconda3/envs/flow_map/bin/nerfbaselines", line 8, in sys.exit(main()) ^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/cli/_common.py", line 499, in invoke return super().invoke(ctx) ^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/cli/_common.py", line 440, in wrapped raise e File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/cli/_common.py", line 433, in wrapped return fn(args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/cli/_train.py", line 128, in train_command method = method_cls( ^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/method.py", line 1689, in init self._setup_train(train_dataset, load_state_dict) File "/root/data_user/ysl/wild-gaussians/wildgaussians/method.py", line 1700, in _setup_train th_cameras = traindataset["cameras"].apply(lambda x, : torch.from_numpy(x).cuda()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/_types.py", line 247, in apply poses=fn(self.poses, "poses"), ^^^^^^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/method.py", line 1700, in th_cameras = traindataset["cameras"].apply(lambda x, : torch.from_numpy(x).cuda()) ^^^^^^^^^^^^^^^^^^^ TypeError: expected np.ndarray (got numpy.ndarray)

jkulhanek commented 1 month ago

Hi, can you please try running:

nerfbaselines shell --method wild-gaussians

And running:

nerfbaselines train --method wild-gaussians --data datasets/0729_powertower_radial/

Also posting output of pip list run from both your outer environment and from nerfbaselines shell ...?

luocha0107 commented 1 month ago

ok,I will try later and post module list. Thank for reply.

---- Replied Message ---- | From | Jonáš @.> | | Date | 10/25/2024 22:39 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [jkulhanek/wild-gaussians] train error (Issue #30) |

Hi, can you please try running:

nerfbaselines shell --method wild-gaussians

And running:

nerfbaselines train --method wild-gaussians --data datasets/0729_powertower_radial/

Also posting output of pip list run from both your outer environment and from nerfbaselines shell ...?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

luocha0107 commented 1 month ago

when i running nerfbaselines shell --method wild-gaussians (flow_map) root@autodl-container-f34d45a126-5b9fd9ad:~/data_user/ysl/wild-gaussians# nerfbaselines shell --method wild-gaussians info: Using method: wild-gaussians, backend: python Traceback (most recent call last): File "/root/miniconda3/envs/flow_map/bin/nerfbaselines", line 8, in sys.exit(main()) ^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/cli/_common.py", line 499, in invoke return super().invoke(ctx) ^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/cli/_shell.py", line 23, in shell_command backend_impl.shell(command if command else None) File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/backends/_common.py", line 198, in shell raise NotImplementedError("shell not implemented") NotImplementedError: shell not implemented

then running nerfbaselines train --method wild-gaussians --data datasets/0729_powertower_radial/ (flow_map) root@autodl-container-f34d45a126-5b9fd9ad:~/data_user/ysl/wild-gaussians# nerfbaselines train --method wild-gaussians --data datasets/0729_powertower_radial/ info: Using method: wild-gaussians, backend: python info: Loading train dataset info: Detecting dataset format from path: /root/data_user/ysl/wild-gaussians/datasets/0729_powertower_radial info: Colmap dataloader is using LLFF split with 207 training and 30 test images info: Loaded unknown dataset from path /root/data_user/ysl/wild-gaussians/datasets/0729_powertower_radial using loader colmap info: Loading images from /root/data_user/ysl/wild-gaussians/datasets/0729_powertower_radial/images loading images: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 207/207 [00:42<00:00, 4.92it/s] info: Loading eval dataset info: Detecting dataset format from path: /root/data_user/ysl/wild-gaussians/datasets/0729_powertower_radial info: Colmap dataloader is using LLFF split with 207 training and 30 test images info: Loaded unknown dataset from path /root/data_user/ysl/wild-gaussians/datasets/0729_powertower_radial using loader colmap info: Loading images from /root/data_user/ysl/wild-gaussians/datasets/0729_powertower_radial/images loading images: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:05<00:00, 5.16it/s] warning: Dataset ID not specified, dataset-specific config overrides may not be applied info: Active presets: info: Using config overrides: {} info: Loading config file default.yml info: using MLP layer as FFN Generating skybox: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 207/207 [00:00<00:00, 1357.67it/s] info: Adding skybox with 49818 points Number of points at initialisation : 108904 info: Output directory: /root/data_user/ysl/wild-gaussians info: Initialized loggers: tensorboard training: 0%| | 0/70000 [00:00<?, ?it/s] Traceback (most recent call last): File "/root/miniconda3/envs/flow_map/bin/nerfbaselines", line 8, in sys.exit(main()) ^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/cli/_common.py", line 499, in invoke return super().invoke(ctx) ^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/cli/_common.py", line 440, in wrapped raise e File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/cli/_common.py", line 433, in wrapped return fn(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/cli/_train.py", line 148, in train_command trainer.train() File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/training.py", line 764, in train metrics = self.train_iteration() ^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/nerfbaselines/training.py", line 694, in train_iteration metrics = self.method.train_iteration(self.step) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/method.py", line 1930, in train_iteration uncertainty_loss, metrics, loss_mult = self.model.uncertainty_model.get_loss(gt_image, image_toned.detach(), _cache_entry=('train', camera_id)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/method.py", line 438, in get_loss loss, metrics, loss_mult = self._compute_losses(gt_torch, image, prefix, _cache_entry=_cache_entry) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/method.py", line 364, in _compute_losses uncertainty = self(self._scale_input(gt, self.config.uncertainty_dino_max_size), _cache_entry=_cache_entry) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/method.py", line 330, in forward return self._forward_uncertainty_features(image, _cache_entry=_cache_entry) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/method.py", line 307, in _forward_uncertainty_features x = self._get_dino_cached(inputs, _cache_entry) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/method.py", line 260, in _get_dino_cached x = self.backbone.get_intermediate_layers(x, n=[self.backbone.num_heads-1], reshape=True)[-1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/dinov2.py", line 800, in get_intermediate_layers outputs = self._get_intermediate_layers_not_chunked(x, n) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/dinov2.py", line 769, in _get_intermediate_layers_not_chunked x = blk(x) ^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/dinov2.py", line 508, in forward return super().forward(x_or_x_list) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/dinov2.py", line 351, in forward x = x + attn_residual_func(x) ^^^^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/dinov2.py", line 330, in attn_residual_func return self.ls1(self.attn(self.norm1(x))) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/flow_map/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/dinov2.py", line 244, in forward return super().forward(x) ^^^^^^^^^^^^^^^^^^ File "/root/data_user/ysl/wild-gaussians/wildgaussians/dinov2.py", line 228, in forward attn = q @ k.transpose(-2, -1) ^~~~~~~ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 37.66 GiB (GPU 0; 23.68 GiB total capacity; 2.39 GiB already allocated; 20.82 GiB free; 2.51 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

pip list as follow: (flow_map) root@autodl-container-f34d45a126-5b9fd9ad:~/data_user/ysl/wild-gaussians# pip list Package Version Editable project location


absl-py 2.1.0 aiohappyeyeballs 2.4.0 aiohttp 3.10.5 aiosignal 1.3.1 antlr4-python3-runtime 4.9.3 anyio 4.6.2.post1 asttokens 2.4.1 attrs 24.2.0 beartype 0.18.5 beautifulsoup4 4.12.3 black 24.8.0 Brotli 1.0.9 certifi 2024.8.30 chardet 5.2.0 charset-normalizer 2.1.1 click 8.1.7 colorlog 6.8.2 contourpy 1.3.0 cycler 0.12.1 dacite 1.8.1 decorator 5.1.1 diff_gaussian_rasterization 0.0.0 /root/data_user/ysl/wild-gaussians/submodules/diff-gaussian-rasterization docker-pycreds 0.4.0 docstring_parser 0.16 einops 0.8.0 embreex 2.17.7.post5 executing 2.1.0 filelock 3.13.1 flow_vis_torch 0.1 fonttools 4.53.1 frozenlist 1.4.1 fsspec 2024.9.0 gdown 5.2.0 gitdb 4.0.11 GitPython 3.1.43 gmpy2 2.1.2 grpcio 1.67.0 h11 0.14.0 httpcore 1.0.6 httpx 0.27.2 huggingface-hub 0.25.0 hydra-core 1.3.2 idna 3.10 imageio 2.36.0 ipython 8.28.0 jaxtyping 0.2.34 jedi 0.19.1 Jinja2 3.1.4 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 lazy_loader 0.4 lightning 2.4.0 lightning-utilities 0.11.7 lxml 5.3.0 manifold3d 2.5.1 mapbox_earcut 1.0.2 Markdown 3.7 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.9.0 matplotlib-inline 0.1.7 mdurl 0.1.2 mediapy 1.2.2 mkl_fft 1.3.10 mkl_random 1.2.7 mkl-service 2.4.0 mpmath 1.3.0 msgpack 1.1.0 multidict 6.1.0 mypy-extensions 1.0.0 nerfbaselines 1.2.5 networkx 3.2.1 nodeenv 1.9.1 numpy 1.26.3 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.68 nvidia-nvtx-cu12 12.1.105 omegaconf 2.3.0 opencv-python 4.10.0.84 packaging 24.1 parso 0.8.4 pathspec 0.12.1 pexpect 4.9.0 pillow 10.4.0 pip 24.2 platformdirs 4.3.6 plyfile 1.0.3 prompt_toolkit 3.0.48 protobuf 4.25.5 psutil 6.0.0 ptyprocess 0.7.0 pure_eval 0.2.3 pycollada 0.8 Pygments 2.18.0 pyliblzfse 0.4.1 pyparsing 3.1.4 PySocks 1.7.1 python-dateutil 2.9.0.post0 pytorch-lightning 2.4.0 PyYAML 6.0.2 referencing 0.35.1 requests 2.28.1 rich 13.9.3 rpds-py 0.20.0 Rtree 1.3.0 ruff 0.6.7 safetensors 0.4.5 scikit-image 0.24.0 scipy 1.14.1 sentry-sdk 2.14.0 setproctitle 1.3.3 setuptools 69.5.1 shapely 2.0.6 shtab 1.7.1 simple_knn 0.0.0 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 soupsieve 2.6 splines 0.3.2 stack-data 0.6.3 svg.path 6.3 sympy 1.13.3 tensorboard 2.17.0 tensorboard-data-server 0.7.2 tifffile 2024.9.20 timm 1.0.9 torch 2.0.1 torchaudio 2.0.2 torchmetrics 1.4.2 torchvision 0.15.2 tqdm 4.66.4 traitlets 5.14.3 trimesh 4.5.1 triton 3.0.0 typeguard 2.13.3 typing_extensions 4.11.0 tyro 0.8.14 urllib3 1.26.20 vhacdx 0.0.8.post1 viser 0.1.34 wandb 0.18.1 wcwidth 0.2.13 websockets 13.1 Werkzeug 3.0.4 wheel 0.44.0 wildgaussians 0.3.0 /root/data_user/ysl/wild-gaussians xatlas 0.0.9 xxhash 3.5.0 yarl 1.11.1 yourdfpy 0.0.56

my gpu is RTX3090(24G),from the terminal display it appears to be out of memory?

jkulhanek commented 1 month ago

Ok, I guess you did local install? now the training works. I don’t know what was the issue before. It seamed like you have installed numpy 2.0 which isn’t compatible with pytorch, but it isnt the case. The issue with oom is perhaps caused by images being too large. Try either disabling the uncertainty loss or downscaling your images.

luocha0107 commented 1 month ago

Ok, I guess you did local install? now the training works. I don’t know what was the issue before. It seamed like you have installed numpy 2.0 which isn’t compatible with pytorch, but it isnt the case. The issue with oom is perhaps caused by images being too large. Try either disabling the uncertainty loss or downscaling your images.

ok, thank you.

jkulhanek commented 3 weeks ago

Is the issue resolved?

luocha0107 commented 3 weeks ago

Is the issue resolved?

no,I tried to cut the number of images in half, but it was still oom. Next I will try to reduce the image resolution.

luocha0107 commented 3 weeks ago

Is the issue resolved?

I have solved the problem with downscaled image size. It's running now. But it show that will take 70000 rounds and a dozen hours,will it stop automatically if the effect becomes good during training?

jkulhanek commented 3 weeks ago

Hi, yes, I meant downscaling the images. The time changes during training, but it sounds like a lot. What GPU are you using?

luocha0107 commented 3 weeks ago

Hi, yes, I meant downscaling the images. The time changes during training, but it sounds like a lot. What GPU are you using?

My gpu is NVIDIA GeForce RTX 3090. I account another problem now when i had trained 2000 rounds and downloaded the Alex-Net model weight,error as follow. image image image

it seems to be the network connection,and I tried it a few times.

jkulhanek commented 3 weeks ago

Hi, can you please verify that you have internet access on the compute node from which you run the command?

luocha0107 commented 3 weeks ago

Hi, can you please verify that you have internet access on the compute node from which you run the command?

ok,thank you. I know why.

zwl995 commented 1 week ago

Hi, can you please verify that you have internet access on the compute node from which you run the command?

ok,thank you. I know why. How did you solve this problem, or what folder should you place the model manually downloaded?

luocha0107 commented 1 week ago

Hi, can you please verify that you have internet access on the compute node from which you run the command?

ok,thank you. I know why. How did you solve this problem, or what folder should you place the model manually downloaded?

I remember the error came after downloading the model. The vpn on my server is not working. I'm not trying anymore.

jkulhanek commented 1 week ago

Hi, sorry for the late reply. I'm a bit busy, but will look at it next week. Currently you need the internet access for the evaluation as the model is not loaded from cache, but streamed directly. Next week I will add the code which will allow the model to be loaded from cache so you could manually download it there. In the mean-time, you can disable the evaluation during training (and evaluate it locally after the training is finished). Would this resolve your issue? In that case, just set --eval-all-steps and --eval-few-steps to something like 1000000 so that it never runs the evaluation.