ultralytics / hub

Ultralytics HUB tutorials and support
https://hub.ultralytics.com
GNU Affero General Public License v3.0
138 stars 14 forks source link

Cannot connect to GPU backend #545

Closed wntun closed 8 months ago

wntun commented 10 months ago

Search before asking

HUB Component

Training

Bug

Hello, I am a hub pro user. I have been training the model using VisDrone for 200 epochs (https://hub.ultralytics.com/models/zoOY84tRcuDUCN5y2dbk). It's been stopped in the mid for a few times, but I could resume retraining. Sometimes I couldn't resume retraining right away, and it showed an error message "Cannot connect to GPU backend". I would like to know if anything I can do to not stop the training or even if it's stopped, resume the training immediately. Thanks in advance! image

Environment

Ultralytics HUB Version v0.1.33 Client User Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Operating System Win32 Browser Window Size 1920 x 959 Server Timestamp 1705914252

Minimal Reproducible Example

No response

Additional

No response

github-actions[bot] commented 10 months ago

👋 Hello @wntun, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

kalenmike commented 10 months ago

@wntun Thanks for asking the question. Colab does not allow you to run code unattended. If you are not interacting with the webpage then your instance will be stopped. This is something put in place by Colab. The GPU message you are seeing is also related to Colab, you can upgrade to a paid Colab account which may allow a higher usage tier.

Your options are to ensure you are interacting with Colab during the entire training or train using your own agent.

wntun commented 10 months ago

I see. I tried to run using own agent and had a Runtime error. I used the same code as in the Colab provided by the Ultralytics. image Could you please help me to figure out how to solve this?

wntun commented 10 months ago

Another thing is that I tried running in the Colab using different gmail account which lets me connect to the GPU right away. But, I cannot resume the training successfully, resulting to the following error. image I'm a novice google colab user. :( Have a nice day, thanks.

wntun commented 10 months ago

I think this is not the case with using different gmail accounts. I tried with the same account as the one I used from the beginning. I faced with the same error and cannot resume the training. image

UltralyticsAssistant commented 10 months ago

@wntun i'm sorry to hear you're experiencing issues with resuming training on your own agent and with Google Colab. For the runtime error you encountered when using your own agent, it's important to ensure that your local environment is set up correctly. This includes having the right dependencies installed and the hardware properly configured to support the training process.

Regarding the errors you're facing with Google Colab, it's possible that there might be an issue with the state of the training checkpoint or the way the session is being resumed. Make sure that the checkpoint files are correctly saved and accessible for the training to resume.

For both cases, I recommend checking the following:

If the problem persists, please provide detailed error messages and descriptions of the steps you've taken so far. This will help in diagnosing the issue more effectively.

Remember to refer to the Ultralytics HUB Docs for guidance on setting up your environment and troubleshooting common issues. Have a great day!

wntun commented 10 months ago

thanks for your suggestions. I will check more details for environment setup. However, I'm not sure how to check if the checkpoint files are corrupted. I'm using the Ultralytics HUB pro and saving the checkpoints are all done inside. Please give me a bit details how to check/solve this.

UltralyticsAssistant commented 10 months ago

@wntun, I'm glad you're looking into the environment setup. Regarding the checkpoint files within Ultralytics HUB Pro, these are managed by the platform, and typically you wouldn't need to worry about their integrity.

However, if you suspect that the checkpoint files might be causing issues with resuming training, here's what you can do:

If you continue to face issues, it might be helpful to reach out to Ultralytics support with details of the error messages you're receiving and the steps you've taken. They can assist in checking the status of your checkpoint files and provide more specific guidance.

Keep in mind that as a Pro user, you have access to additional support resources, so don't hesitate to use them. Good luck, and I hope you're able to resolve the issue soon! 🍀

wntun commented 10 months ago

Yes, I didn't find the log in the hub, but saw a epoch-164.pt in the Colab. I found it's only 1KB. I think the file is corrupted inside the hub. If it's the case, I want to roll back and resume training from the nearest stable checkpoint. How can I do that? This is my model link: https://hub.ultralytics.com/models/zoOY84tRcuDUCN5y2dbk By the way, I confirmed that I've enough storage space and permissions are also fine. Please check the detail error message in the pictures below. image image image

UltralyticsAssistant commented 10 months ago

@wntun, it seems like you've identified a potential issue with the checkpoint file. If the file size is only 1KB, it's likely that the file is indeed corrupted. To roll back and resume training from the nearest stable checkpoint, you should be able to select a previous checkpoint from within the Ultralytics HUB interface.

Here's a general idea of what you can do:

If you're unable to find this option or if the issue persists, I recommend reaching out directly to Ultralytics support through the appropriate channels provided for Pro users. Provide them with the details of the issue, the error messages, and the model link. They will be able to assist you further, including potentially restoring checkpoints if necessary.

I hope this helps, and you're back on track with your training soon! 🚀

wntun commented 10 months ago

Hello, I can't find the direct support channel. How do I do that? Thanks.

UltralyticsAssistant commented 10 months ago

@wntun i apologize for any confusion. As a Pro user, you should have access to priority support. If you're having trouble finding the direct support channel, please follow these steps:

  1. Log in to your Ultralytics HUB account.
  2. Navigate to the support or help section, which is typically accessible from the dashboard or account menu.
  3. Look for an option to contact support or submit a help request. This might be in the form of a support ticket system, email address, or a chat service.

If you still can't find a way to contact support directly through the HUB, you may consider reaching out through the official Ultralytics email provided for support inquiries, which is usually available on the Ultralytics website or documentation.

Provide them with a detailed description of your issue, any relevant error messages, and your model link. They should be able to guide you through the process of rolling back to a stable checkpoint and resuming your training.

I hope this helps, and you receive the assistance you need promptly! 🌟

wntun commented 10 months ago

I already checked the support section. But, there is no specific ticket system at all. I'm gonna stop using the HUB after this month and I don't think I will recommend anyone the HUB pro. It is much better using the own PC or simply merge the Google Colab and Google drive for training.

UltralyticsAssistant commented 10 months ago

@wntun, I'm truly sorry to hear about your experience and the frustration it has caused. It's unfortunate when technical issues impact your work, and I understand how important reliable support is for resolving such matters.

While I can't directly address the support system availability, I want to assure you that your feedback is valuable. The Ultralytics team is continuously working to improve the HUB and the support provided to users. If there's anything specific you'd like to be addressed or any feedback you wish to share about your experience, please feel free to provide that information, and I will ensure it is passed on to the team.

Your insights can be instrumental in enhancing the platform and the support process for all users. Thank you for taking the time to share your thoughts, and I hope that your future endeavors in model training are more seamless and successful, regardless of the platform you choose. 🌟

github-actions[bot] commented 8 months ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐