neptune-ai / open-solution-mapping-challenge

Open solution to the Mapping Challenge :earth_americas:
https://www.crowdai.org/challenges/mapping-challenge
MIT License
378 stars 96 forks source link

Environment pytorch-0.2.0-gpu-py3 is invalid. #181

Closed carbonox-infernox closed 5 years ago

carbonox-infernox commented 5 years ago

I'm trying to run this through Neptune, but when I use the command

neptune send --worker gcp-large --environment pytorch-0.2.0-gpu-py3 main.py -- prepare_masks

I get

Environment pytorch-0.2.0-gpu-py3 is invalid.

It recommends to go to https://docs.neptune.ml/advanced-topics/environments/#available-environments to find out which version of pytorch is supported, but pytorch 0.2.0 isn't even there. How am I supposed to run pytorch 0.2.0 in Neptune if it isn't supported by Neptune?

carbonox-infernox commented 5 years ago

replacing 0.2.0 with 0.3.1 in the command seems to have worked. I can't guarantee that because, while I've gotten past that particular error, I am still unable to run this for other reasons.

kamil-kaczmarek commented 5 years ago

Hi, @carbonox-infernox,

Please make sure that you have code from the master branch. There, in the requirements.txt, file we have torch==0.3.1 which works fine in our solution.

I already updated REPRODUCE_RESULTS.md with correct PyTorch version. Thanks for pointing this :)

Best, Kamil

carbonox-infernox commented 5 years ago

Thanks!

carbonox-infernox commented 5 years ago

@kamil-kaczmarek

If I pay for an instance of Neptune with one p100 GPU and use the command:

neptune send --worker gcp-large --environment pytorch-0.3.1-gpu-py3 main.py -- train --pipeline_name unet_weighted

Will this model make use of that GPU for training? I'm running the training on a CPU instance of EC2 but it's taking too long.

kamil-kaczmarek commented 5 years ago

Hi @carbonox-infernox,

If you want to use single P100 GPU, use something like this:

neptune send --worker l-p100 --environment pytorch-0.3.1-gpu-py3 main.py -- train --pipeline_name unet_weighted

PyTorch will make use of this GPU to train your model. You can observe this live, after clicking on the Monitoring tab (left side of the screen, when inside experiment). Here is an example: SAL-1890

Take a look at other machines in the compute-resources section, because you can have up to four P100 or eight K80 do to heavy lifting :rocket:

Best, Kamil Kaczmarek

carbonox-infernox commented 5 years ago

@kamil-kaczmarek cool, thanks! I will definitely consider using multiple GPUs. To me, it depends on how well the model scales up to use them.

I was intending to ask this at some point, but what would the optimal batch size be? On my EC2 instance using the default batch size of 20, only about 1/3 of the CPU and 1/3 of the RAM are actually being utilized. Would increasing the batch size use more resources to train more quickly?

kamil-kaczmarek commented 5 years ago

@carbonox-infernox we regularly trained our models on four GPUs.

In general, I do not recommend to train large model on CPU -> it will take very long time.

In principle, it is good to try several values of the batch size and observe memory and CPU utilization. Try few values until you reach optimal usage.

The rule of thumb here is to use maximal batch size that fits in GPU memory. Larger batches imply more stable training.

Best, Kamil

carbonox-infernox commented 5 years ago

@kamil-kaczmarek Nice, thanks. That's sort of what I expected. I'm really hoping Neptune finishes preparing masks soon (it's been 5:10 so far) so that I can start training and then go home haha.

As for what fits in memory: Is this one of those cases where 2 GPUs that have each have 16 GB for example will only effectively have 16 because they mirror each other (like in gaming)? Or will they effectively add?

kamil-kaczmarek commented 5 years ago

Hi @carbonox-infernox,

When you train on multiple GPUs, each batch is being divided into separate pieces and then loaded on GPU. For example: your batch size is 40 and you train on 2 GPUs. Then, each batch is split into two pieces, 20 examples each. One piece goes to GPU_1, other piece goes to GPU_2.

Best, Kamil Kaczmarek

carbonox-infernox commented 5 years ago

@kamil-kaczmarek Thanks!

Another question: Will GPUs be bottlenecked by the size of the gcp instance? For example, let's say that I want to use 4 p100 GPUs. Will it matter which gcp is the foundation?

kamil-kaczmarek commented 5 years ago

Hi @carbonox-infernox,

The more GPUs you have, the more GPU memory is available for you, thus you can load larger batches (which is good for your training).

Usually (but not always) it is good to have no less memory than GPU memory. However, good hardware setup needs to be determined case-by-case. It strongly depends on your code that you run. In Mapping Challenge we did not strictly track hardware requirements (but we trained on multiple GPUs).

Optimizing hardware is an experimental trial-and-error work -> it also depends what are you trying to optimize: time or cost (or maybe something else).

Best, Kamil