mehta-lab / microDL

3D virtual staining with 2D and 2.5D U-Nets
BSD 3-Clause "New" or "Revised" License
27 stars 7 forks source link

Improve data loading performance #232

Closed ziw-liu closed 1 year ago

ziw-liu commented 1 year ago

Problem

Data loading during training was slow. Most of the time is spent on I/O and augmentation.

Performance tweaks

Behavior changes and fixes

Result

After enabling caching, 64 data-loading workers can saturate an A100 GPU with $B \times C \times D \times W \times H = 32 \times 2 \times 5 \times 512 \times 512$ batches. Training on 300 FOVs (80/20 split) now takes 5 min/epoch.

Epoch/hour:

image

I have not investigated the impact of system RAM on file system caching performance. During the above test a very large amount of RAM ($1536 \times 0.5 = 768$ GB, decompressed dataset is 500 GB) was available for ZFS caching.

ziw-liu commented 1 year ago

Debug why tensorboard is not visible from the script to generate plots and movies.

The load_event_file function does not exist in the namespace. Importing valid attributes work.

We can implement this feature in another PR.