neuronets / trained-models

Trained TensorFlow models for 3D image processing
https://neuronets.dev/trained-models
22 stars 15 forks source link

Mismatch between hardware referenced in docker files and hardware we have at out disposal #82

Closed hvgazula closed 1 year ago

hvgazula commented 1 year ago

There is a mismatch between the hardware supported by the cuda versions referenced in the docker files and the ones we have at our disposal (for testing purposes). Requesting the said resources (which are outside gablab reserved resources) is taking a lot of time.

Building a zoo that can hold different models from different environments means we should have the requisite range of hardware (gpu cards) to support such diverse environments. However, the range of cards we have is limiting and thus impacts our ability to test all models. Am I missing any other simple solution to overcome this problem?

satra commented 1 year ago

gablab has a lot of associated and varied GPUs. the V100s on dgx1 should work for most things no?

hvgazula commented 1 year ago

I will give it a try one more time.

hvgazula commented 1 year ago

Thanks @satra. Indeed, the code worked fine on tesla-v100 on openmind. But @gaiborjosue confirmed that it failed on the ec2 instance (which has a T4). This definitely seems to be a problem area as we look at integrating more and more models with different cuda/tf/torch specs.

gaiborjosue commented 1 year ago

Hello, indeed. It failed due to different cuda versions.