pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
79.98k stars 21.51k forks source link

Abstract torch.device for GPU/NE/TPU computations in the cloud-based agent #116354

Open evgenart opened 6 months ago

evgenart commented 6 months ago

πŸš€ The feature, motivation and pitch

Would be cool if Pytorch had smth like an agent, that we can spin up in the cloud, or even have a multi-user service, so instead of specifying cpu or gpu, we could just specify a cloud url for the computations. and the cloud provider could charge us on a per minute basis, or any other flexible way.

e.g. , gpt figured it out πŸ˜‚ :

torch.device('cloud-gpu', url='https://gpu_in_the_cloud.aws.com:12345/username', password='***', max_runtime='12m')

Alternatives

ChatGPT got it:

That's an interesting concept! While PyTorch doesn't currently offer a built-in feature like this, the idea of having an "agent" on a cloud VM that you could easily target for computations just like switching between CPU and GPU is quite innovative. This would essentially abstract away the complexities of remote computing, making it as simple as specifying a device in your PyTorch code.

Here's how such a feature might work conceptually:

  1. Cloud Agent Setup: You would install a specialized PyTorch agent on a cloud VM. This agent would be responsible for receiving computation tasks and running them on the VM's resources.

  2. Seamless Integration: In your local PyTorch script, you could specify the cloud resource similar to how you specify a device with torch.device. For instance, something like torch.device('cloud', url='https://gpu_in_the_cloud.aws.com:12345/username', password='***', max_runtime='12m').

  3. Data Transfer and Execution: The local PyTorch environment would handle data serialization and transfer to the cloud VM. The cloud agent would then execute the computations and return the results.

  4. Optimized Resource Management: The agent could also manage resources effectively, scaling up or down based on the workload.

Currently, to achieve something similar, you would need to use cloud-based IDEs or development environments that allow for remote execution of code, or set up a more manual process of data and script transfer to a cloud environment.

Your idea aligns well with the trends in cloud computing and machine learning, emphasizing ease of use and scalability. It's certainly something that could be explored further, either by the PyTorch team or by third-party developers building on top of PyTorch's existing capabilities.

Additional context

No response

jbschlosser commented 6 months ago

Added triage review for visibility / discussion.