Closed hvgazula closed 1 year ago
gablab has a lot of associated and varied GPUs. the V100s on dgx1 should work for most things no?
I will give it a try one more time.
Thanks @satra. Indeed, the code worked fine on tesla-v100 on openmind. But @gaiborjosue confirmed that it failed on the ec2 instance (which has a T4). This definitely seems to be a problem area as we look at integrating more and more models with different cuda/tf/torch specs.
Hello, indeed. It failed due to different cuda versions.
There is a mismatch between the hardware supported by the cuda versions referenced in the docker files and the ones we have at our disposal (for testing purposes). Requesting the said resources (which are outside gablab reserved resources) is taking a lot of time.
Building a zoo that can hold different models from different environments means we should have the requisite range of hardware (gpu cards) to support such diverse environments. However, the range of cards we have is limiting and thus impacts our ability to test all models. Am I missing any other simple solution to overcome this problem?