keras-team / keras-nlp

Modular Natural Language Processing workflows with Keras
Apache License 2.0
758 stars 227 forks source link

cannot create URI for a fine-tuned Gemma model for Kaggle upload #1697

Closed wkambale closed 3 weeks ago

wkambale commented 1 month ago

When running the code below in Google Colab to fine-tune the Gemma 2B model, the runtime (A100 GPU on High-RAM) keeps disconnecting and keeps the cell running for forever:

kaggle_username = userdata.get('KAGGLE_USERNAME')
model_name = "gemma"
variation_name = "new_variation_name" # new_variation_name is a placeholder

uri = f"kaggle://{kaggle_username}/{model_name}/keras/{variation_name}"
uri

This Colab Notebook reproduces this bug. At the moment, I'm running a copy with a new dataset to fine-tune a new model.

The code above is expected to run successfully to be able to upload the new model to Kaggle with the code below:

keras_nlp.upload_preset(uri, preset)

Note: The preset is saved a cell before the above code here:

preset = "./new_variation_name"
gemma_lm.save_to_preset(preset)
SamanehSaadat commented 1 month ago

Hi @wkambale!

Thanks for reporting the issue and thanks for sharing a repro colab.

When I ran your colab, the upload cell completed successfully and printed these logs: image

The model is uploaded here: https://www.kaggle.com/models/smnsdt/gemma/keras/medical_gemma (I only trained on a small subset of data (100 samples) but that shouldn't have any impact on the upload.)

Does that upload cell print logs at all? (I want to see if I can understand where it gets stuck! e.g. is it before the upload starts or in the middle of the upload?)

SamanehSaadat commented 1 month ago

Another thing that might be worth trying is to try to upload to Hugging Face Hub to see if the same issue happens. That can help us understand if the issue is on the Keras side or Kaggle side.

To upload to Hugging Face, you can change your URI to hf://<HF_USERNAME>/<MODEL>.

wkambale commented 1 month ago

hi @SamanehSaadat,

i think (or realize) that the issue coud be this and this. it seems Colab Pro and Colab Pro + users are experiencing runtime issues.

like i said, my hosted runtime (A100 GPU on High-RAM) keeps disconnecting and does not automatically reconnect which keeps the cell running for forever.

i'll try HF and see. thanks 👍

SamanehSaadat commented 1 month ago

I see! Thanks for sharing the links! If the issue is on Colab side, maybe we should wait for their response and test again when the issue is fixed.

wkambale commented 1 month ago

yeah, sure. thanks. i will share an update as soon as there's one.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.