mlcommons / training_results_v4.0

This repository contains the results and code for the MLPerf™ Training v4.0 benchmark.
https://mlcommons.org/benchmarks/training
Apache License 2.0
12 stars 14 forks source link

Nvidia Llama2_70B incorrect model hash #9

Open wahabk opened 1 week ago

wahabk commented 1 week ago

Hello, I'm getting an incorrect hash after downloading the model

Traceback (most recent call last):
  File "~/repos/training_results_v4.0/NVIDIA/benchmarks/llama2_70b_lora/implementations/nemo/scripts/download_model.py", line 28, in <module>
    assert directory_hash == "742093293d1c0c227cfe458365d32ab4"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

I tried to go ahead and train the model, but I get a tensor store error:

[rank0]: ValueError: FAILED_PRECONDITION: Error reading local file "tmp/tmp6e6zpd3l/model_weights/model.decoder.layers.mlp.linear_fc2.weight/63.0.0": Not enough data: expected at least 469762048; at byte 349175808 [source locations='tensorstore/internal/riegeli/array_endian_codec.cc:218\ntensorstore/driver/zarr/metadata.cc:484\ntensorstore/internal/cache/kvs_backed_chunk_cache.cc:62\ntensorstore/internal/cache/kvs_backed_cache.h:208']

You can see the sizes of the files here:

$ ls -lh
total 129G
-rw-r--r-- 1 usr.grp usr.grp  846 Oct 16 10:35 config.json
-rw-r--r-- 1 usr.grp usr.grp 1.2K Oct 16 10:35 convert.py
-rw-r--r-- 1 usr.grp usr.grp  240 Oct 16 10:35 generation_config.json
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 10:39 model-00001-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 10:39 model-00002-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.7G Oct 16 10:39 model-00003-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.7G Oct 16 10:53 model-00004-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 10:39 model-00005-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 10:54 model-00006-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 10:52 model-00007-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.7G Oct 16 10:55 model-00008-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.7G Oct 16 10:55 model-00009-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 10:52 model-00010-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 10:52 model-00011-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 10:52 model-00012-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.7G Oct 16 10:53 model-00013-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.7G Oct 16 10:39 model-00014-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 10:39 model-00015-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 10:39 model-00016-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 08:31 model-00017-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.7G Oct 16 08:32 model-00018-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.7G Oct 16 08:32 model-00019-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 08:31 model-00020-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 08:32 model-00021-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 08:32 model-00022-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.7G Oct 16 08:33 model-00023-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.7G Oct 16 08:32 model-00024-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 08:32 model-00025-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 08:32 model-00026-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.4G Oct 16 08:32 model-00027-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 4.7G Oct 16 08:33 model-00028-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp 3.6G Oct 16 08:32 model-00029-of-00029.safetensors
-rw-r--r-- 1 usr.grp usr.grp  70K Oct 16 08:31 modeling_llama.py
-rw-r--r-- 1 usr.grp usr.grp  46K Oct 16 08:31 model.safetensors.index.json
-rw-r--r-- 1 usr.grp usr.grp   24 Oct 16 10:35 README.md

I'm using huggingface_hub==0.24.0. Do you know what could be causing the issue? I've redownloaded the model several times.

wahabk commented 6 days ago

Hello, I've tried with

huggingface_hub==:

Some of these download with symlinks to ~/.cache and some just use the local directory. All fail at the hash assertion.

Any advice appreciated as I am currently stuck in the FAILED_PRECONDITION: Error reading local file. I've tried downloading this inside and outside a container. This is on a GH200 aarch64 machine. Let me know if you need more details.