google-deepmind / recurrentgemma

Open weights language model from Google DeepMind, based on Griffin.
Apache License 2.0
597 stars 25 forks source link

[Bug] ValueError: Truncated Zstd-compressed stream error when running fine_tuning_tutorial_jax.ipynb on CPU #8

Closed Sunwood-ai-labs closed 2 months ago

Sunwood-ai-labs commented 3 months ago

🐛 Bug Description

When running the fine_tuning_tutorial_jax.ipynb notebook on a CPU in Google Colab, I encountered the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-23-7c1a19ca56fd>](https://localhost:8080/#) in <cell line: 2>()
      1 # Load parameters
----> 2 params =  recurrentgemma.load_parameters(ckpt_path, "single_device")
      3 config = recurrentgemma.GriffinConfig.from_flax_params_or_variables(
      4     params,
      5     preset=recurrentgemma.Preset.RECURRENT_GEMMA_2B_V1,

20 frames
[/usr/lib/python3.10/asyncio/futures.py](https://localhost:8080/#) in result(self)
    199         self.__log_traceback = False
    200         if self._exception is not None:
--> 201             raise self._exception.with_traceback(self._exception_tb)
    202         return self._result
    203 

ValueError: FAILED_PRECONDITION: Error reading "blocks.0.recurrent_block.rg_lru.a_gate.b/0.0" in OCDBT database at local file "/root/.cache/kagglehub/models/google/recurrentgemma/flax/2b-it/1/2b-it/": Truncated Zstd-compressed stream; at byte 0; at uncompressed byte 0 [source locations='tensorstore/internal/riegeli/array_endian_codec.cc:212\ntensorstore/driver/zarr/metadata.cc:481\ntensorstore/internal/cache/kvs_backed_chunk_cache.cc:52\ntensorstore/internal/cache/kvs_backed_cache.h:208']

The error occurs when loading the parameters using the following code:

params = recurrentgemma.load_parameters(ckpt_path, "single_device")

🌍 Environment

📝 Steps to Reproduce

  1. Open the fine_tuning_tutorial_jax.ipynb notebook in Google Colab.
  2. Set the runtime type to CPU.
  3. Run the notebook cells up to the point where the parameters are loaded.

Please let me know if you need any further information or if there are any steps I can take to assist in resolving this issue. Thank you for your attention to this matter.

botev commented 3 months ago

Thanks very much for raising this issue. There seems to be a problem with the latest kagglehub package on PyPI (e.g. version 0.2.6), which for some reason does not download the orbax checkpoint data files.

As a intermediate solution: Uninstall kagglehub from your package manager (pip in the notebook) and then reinstall it with a pinned version at 0.2.5, e.g. pip install kagglehub==0.2.5.

We will investigate with the Kaggle team the issue in the mean time.

botev commented 2 months ago

This should now be resolved. Closing due to inactivity.