ultralytics / hub

Ultralytics HUB tutorials and support
https://hub.ultralytics.com
GNU Affero General Public License v3.0
139 stars 14 forks source link

Error in access permission to start training from last.pt in Google Colab #565

Closed maxwouters closed 8 months ago

maxwouters commented 9 months ago

Search before asking

HUB Component

Training

Bug

I wanted to train my model in the Google Colab. I had to restart it a couple of times and it always started retraing from the last epoch. Now I can the error I don't have access to the last.pt anymore.

I receive this error: WARNING ⚠️ Unable to automatically guess model task, assuming 'task=detect'. Explicitly define task for your model, i.e. 'task=detect', 'segment', 'classify','pose' or 'obb'.

UnpicklingError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/ultralytics/engine/trainer.py in check_resume(self, overrides) 632 # Check that resume data YAML exists, otherwise strip to force re-download of dataset --> 633 ckpt_args = attempt_load_weights(last).args 634 if not Path(ckpt_args["data"]).exists():

7 frames UnpicklingError: invalid load key, '<'.

The above exception was the direct cause of the following exception:

FileNotFoundError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/ultralytics/engine/trainer.py in check_resume(self, overrides) 643 644 except Exception as e: --> 645 raise FileNotFoundError( 646 "Resume checkpoint not found. Please pass a valid checkpoint to resume from, " 647 "i.e. 'yolo train resume model=path/to/last.pt'"

FileNotFoundError: Resume checkpoint not found. Please pass a valid checkpoint to resume from, i.e. 'yolo train resume model=path/to/last.pt'

Environment

OS Ventura. Google Colab

Minimal Reproducible Example

No response

Additional

No response

github-actions[bot] commented 9 months ago

👋 Hello @maxwouters, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

UltralyticsAssistant commented 9 months ago

@maxwouters hello! It seems like you're encountering an issue with resuming training from a checkpoint in Google Colab. The error message indicates that the last.pt file is either not found or cannot be accessed.

Here are a few steps you can take to troubleshoot this issue:

  1. Check File Path: Ensure that the path to last.pt is correct and that the file is indeed present in the directory you're working from.

  2. File Permissions: If the file exists, verify that you have the correct permissions to access it. In Google Colab, you might need to remount your Google Drive or re-establish any symbolic links if you're using them.

  3. Corrupted File: The UnpicklingError suggests that the file might be corrupted. If you have a backup of last.pt, try using that instead.

  4. Explicit Task Definition: The warning message suggests defining the task explicitly. Make sure to specify the task (e.g., task=detect) when setting up your training configuration.

If you continue to face issues, please refer to the Ultralytics HUB Docs for more detailed guidance on setting up and resuming training. If the problem persists, feel free to provide additional details, and we'll be happy to help you further. Keep up the great work with your model training! 😊👍

kalenmike commented 9 months ago

@maxwouters It looks like your weights may be corrupted. Are you able to share your model id so that I can look into this in more detail?

https://hub.ultralytics.com/models/<model-id>

maxwouters commented 9 months ago

This is the link to my model: https://hub.ultralytics.com/models/djv7U1Yyt75Xdq9uOf3e?tab=train

kalenmike commented 9 months ago

@maxwouters I can't find any issue there. Could you share the entire output you are getting from Colab taking care to remove your API Key?

maxwouters commented 9 months ago

Ultralytics HUB: New authentication successful ✅ Ultralytics HUB: View model at https://hub.ultralytics.com/models/djv7U1Yyt75Xdq9uOf3e 🚀 Downloading https://storage.googleapis.com/ultralytics-hub.appspot.com/users/IYbXO7FhomYGie2aWkawZotM55k1/models/djv7U1Yyt75Xdq9uOf3e/epoch-90.pt to 'epoch-90.pt'... ⚠️ Download failure, retrying 1/3 https://storage.googleapis.com/ultralytics-hub.appspot.com/users/IYbXO7FhomYGie2aWkawZotM55k1/models/djv7U1Yyt75Xdq9uOf3e/epoch-90.pt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=firebase-adminsdk-jsjt9%40ultralytics-hub.iam.gserviceaccount.com%2F20240208%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240208T112643Z&X-Goog-Expires=900&X-Goog-SignedHeaders=host&X-Goog-Signature=6ba4092aea97ba365bcbeae80b0949351b9ae507fe8530d33d55e3a1b09a6a4e9574e80c3c8a29b203c9b59b19e4e1d736a7c8d9aa70d2e8ec03a154dd949d3ae121036e217397487fe674cfdc1164115eb2efa5a414dbc84a131cfea2ceeac30f6a637b6382935c834bb42cdc6fb8857965f2250aa03fe1feaebaf2c8dccef758fa016a3042173ba30bf58fa7d9f71c2e36c58164ee65f73ad82881094488c5419a7a51a53b1dd99f1917f1cc062e16413c4e782a524929ee27d63f2486af55cc8cff225e29e634cd017df550c79b37e3d204b5fd499386c8916d57584a9b75bed18e24a051597e842b6f367677fdca8db66396cf52a8ca26eea582ef86c652... WARNING ⚠️ Unable to automatically guess model task, assuming 'task=detect'. Explicitly define task for your model, i.e. 'task=detect', 'segment', 'classify','pose' or 'obb'.

UnpicklingError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/ultralytics/engine/trainer.py in check_resume(self, overrides) 632 # Check that resume data YAML exists, otherwise strip to force re-download of dataset --> 633 ckpt_args = attempt_load_weights(last).args 634 if not Path(ckpt_args["data"]).exists():

7 frames UnpicklingError: invalid load key, '<'.

The above exception was the direct cause of the following exception:

FileNotFoundError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/ultralytics/engine/trainer.py in check_resume(self, overrides) 643 644 except Exception as e: --> 645 raise FileNotFoundError( 646 "Resume checkpoint not found. Please pass a valid checkpoint to resume from, " 647 "i.e. 'yolo train resume model=path/to/last.pt'"

FileNotFoundError: Resume checkpoint not found. Please pass a valid checkpoint to resume from, i.e. 'yolo train resume model=path/to/last.pt'

The googleapi call returns the following message:

AccessDenied Access denied.
Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).
UltralyticsAssistant commented 9 months ago

@maxwouters thank you for providing the detailed error message. It appears that there's an issue with downloading the checkpoint file due to access permissions. The error message from the Google API indicates that the request is being made by an anonymous caller, which suggests that the authentication may not be properly configured or that the session has expired.

Here's what you can do to resolve this:

  1. Re-authenticate: Make sure that you are authenticated in Google Colab and have the necessary permissions to access the file in Google Cloud Storage.

  2. Check Model Access: Verify that the model is shared with your Google account if it's in a shared workspace or that you have the correct access rights if it's in your private workspace.

  3. Session Timeout: If you had a long-running session, it's possible that the authentication token has expired. Try re-authenticating or restarting your Colab session to refresh the credentials.

  4. Explicit Task Definition: As the warning suggests, explicitly define the task in your training configuration to avoid any ambiguity.

If you've checked these points and the issue persists, you may need to look into the permissions set on the Google Cloud Storage bucket to ensure that your authenticated user has the storage.objects.get permission for the object in question.

Keep in mind that sharing your API key or any sensitive information publicly can compromise the security of your data, so always be cautious about that. If you need further assistance, please reach out with additional details while ensuring your API key remains confidential. Good luck with your training! 🚀

github-actions[bot] commented 8 months ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐