Peculiar behavior of N2V2 prediction

somas193 commented 7 months ago

I am benchmarking the performance of N2V2 prediction using files of different sizes (ranging from 330MB to 5.13GB) on a local server having an A6000 (48GB GDDR6) GPU and an HPC Cluster having A100 (40GB HBM2) GPUs. The performance is similar for smaller files and for bigger files the prediction is faster by upto 50% on the local server even though we expected the opposite. I am wondering if N2V2 uses FP32/FP16 in the backend, if it can make use of Tensor cores and also if there is frequent data transfer between the GPU memory, CPU cache and RAM? Would someone be able to provide details regarding this?

somas193 commented 7 months ago

UPDATE: I did some analysis and it turned out that there is a bug in one of the libraries used by us to get energy usage values. This bug specifically affects HPC Clusters and was a major contributor in slowing down the prediction process on the HPC Cluster. I ran the benchmark again without the library and the numbers look a lot more promising but the local server is still upto 16% faster than the HPC cluster for bigger files even though the HPC Cluster has the superior GPU. I am wondering if the filesystems play a major role for the N2V model? We use a distributed filesystem on the HPC Cluster and a SATA HDD on the local server to read and write images.

jdeschamps commented 7 months ago

Hi @somas193 !

Tough to answer, our HPC has very slow read speed from our centralized storage... Now, the current implementation of N2V just loads all data in memory so access to local files should only be a bottleneck at the very beginning of the training, once the training has started the limiting factor should be transfer of data between RAM and GPU.

How does the GPU utilization compare between HPC and local server?

tibuch commented 7 months ago

There is no FP16 magic going on.

somas193 commented 7 months ago

Hi @somas193 !

Tough to answer, our HPC has very slow read speed from our centralized storage... Now, the current implementation of N2V just loads all data in memory so access to local files should only be a bottleneck at the very beginning of the training, once the training has started the limiting factor should be transfer of data between RAM and GPU.

How does the GPU utilization compare between HPC and local server?

Hi @jdeschamps, thanks for the reply. I have a small correction in the information provided by me in the post. We use N2V and not N2V2 since our data is a Z-stack. The storage used on the HPC Cluster in this case isn't very slow but also isn't the fastest available storage option. It is based on the BeeGFS parallel file system. I do not have any measured numbers for the GPU utilization but based on my observation it is pretty high (hovering around 90% or more) on the local server. However, I have no data for GPU utilization on the HPC Cluster. Also, I use the piece of code given below to control memory allocation on the GPU:

# Set tensorflow to dynamically grow the allocated GPU memory based on requirement
        gpus = tf.config.list_physical_devices('GPU')
        if gpus:
            try:
                # Currently, memory growth needs to be the same across GPUs
                for gpu in gpus:
                    tf.config.experimental.set_memory_growth(gpu, True)
                logical_gpus = tf.config.list_logical_devices('GPU')
                print(len(gpus), 'Physical GPUs,', len(logical_gpus), 'Logical GPUs')
            except RuntimeError as e:
                # Memory growth must be set before GPUs have been initialized
                print(e)

Do you think this can cause problems from the performance perspective? The version of Tensorflow is 2.10.

somas193 commented 5 months ago

There is no FP16 magic going on.

Would it be correct to say that N2V does not make use of features like tensor cores or mixed precision available on NVIDIA GPUs? Is it just using just vanilla FP32 computations in training and inference?

jdeschamps commented 5 months ago

Hi @somas193,

Sorry I missed you previous question.

Do you think this can cause problems from the performance perspective? The version of Tensorflow is 2.10.

I can't say, it is not something that we really use (and we don't really use TensorFlow nowadays, especially with recent hardware and libraries). What happens if you turn it off?

Would it be correct to say that N2V does not make use of features like tensor cores or mixed precision available on NVIDIA GPUs? Is it just using just vanilla FP32 computations in training and inference?

Yes, just vanilla FP32. The library was written a few years ago, and does not benefit from the "latest" features that people commonly use nowadays.

juglab / n2v

Peculiar behavior of N2V2 prediction #147