Low GPU Usage - Cache/DataLoader Issue?

I'm running CUDA 10.1 with the latest versions of TF and pyTorch a TeslaK80 and a 1080ti.

I'm running the stable version (0.1.1 -- I was unable to get the ESPnet version running) with a patched train.py implementing data_parallel_workaround() from master.

The model seems to be training -- but very inefficiently. If I watch GPU usage with nvidia-smi I see only intermittent GPU-Util spikes with CPU utilization at about 25% (8 cores @ 4.8 ghz).

hparams that may be relevant:

    # Data loader
    pin_memory=True,
    num_workers=12,

    # Training:
    batch_size=12,

Do I just need to dramatically increase the num_workers to feed the GPUs more data? GPU temps look fine, data is on a super fast SSD, so I'm not sure what I'm doing wrong.

FWIW, here's what I show in the python.exe stack:

r9y9 / wavenet_vocoder

Low GPU Usage - Cache/DataLoader Issue? #199