Closed ziw-liu closed 1 year ago
Debug why tensorboard is not visible from the script to generate plots and movies.
The load_event_file
function does not exist in the namespace. Importing valid attributes work.
We can implement this feature in another PR.
Problem
Data loading during training was slow. Most of the time is spent on I/O and augmentation.
Performance tweaks
/tmp
(or Windows equivalent) if a dataset of the same name does not existBehavior changes and fixes
architecture
parameter in config filesdata.caching
parameter (default tofalse
)model.log_num_samples
parameter (default to8
)model.log_num_samples
batchesResult
After enabling caching, 64 data-loading workers can saturate an A100 GPU with $B \times C \times D \times W \times H = 32 \times 2 \times 5 \times 512 \times 512$ batches. Training on 300 FOVs (80/20 split) now takes 5 min/epoch.
Epoch/hour:
I have not investigated the impact of system RAM on file system caching performance. During the above test a very large amount of RAM ($1536 \times 0.5 = 768$ GB, decompressed dataset is 500 GB) was available for ZFS caching.