Mismatch between hardware referenced in docker files and hardware we have at out disposal

hvgazula commented 1 year ago

There is a mismatch between the hardware supported by the cuda versions referenced in the docker files and the ones we have at our disposal (for testing purposes). Requesting the said resources (which are outside gablab reserved resources) is taking a lot of time.

Updating the version of cuda and torch/tensorflow in the docker file. This can have unintended consequences.
Dandi hub could be an option but the space assigned is very small.
Moving this testing to AWS seems the most likely option.

Building a zoo that can hold different models from different environments means we should have the requisite range of hardware (gpu cards) to support such diverse environments. However, the range of cards we have is limiting and thus impacts our ability to test all models. Am I missing any other simple solution to overcome this problem?

satra commented 1 year ago

gablab has a lot of associated and varied GPUs. the V100s on dgx1 should work for most things no?

hvgazula commented 1 year ago

I will give it a try one more time.

hvgazula commented 1 year ago

Thanks @satra. Indeed, the code worked fine on tesla-v100 on openmind. But @gaiborjosue confirmed that it failed on the ec2 instance (which has a T4). This definitely seems to be a problem area as we look at integrating more and more models with different cuda/tf/torch specs.

gaiborjosue commented 1 year ago

Hello, indeed. It failed due to different cuda versions.

neuronets / trained-models

Mismatch between hardware referenced in docker files and hardware we have at out disposal #82