Preprocessing in GPU - Githubissues

When running inference on a GPU, the runtime is dominate by the preprocessing step - specifically creatmat function. The current implementation is not vectorized and runs on CPU. Although using worker threads in dataloader helps, it is still the main bottleneck.

I have thus implemented a vectorized creatmat function, that can be moved to GPU. This makes inference much faster (10x speedup on some tests I ran).

Speed

I ran inference on Colab with a T4 GPU and a 2-core CPU (Xeon) with 20 sequences each of length 600. The numbers below are roughly what i observed running multiple times, they fluctuate by ~10%.

Current implementation: 120s Vectorized GPU: 12s

One run through creatmat takes about 0.2s now compared to the earlier 4s.

Details

The implementation is the exact same operations as before. But it is numerically not exact due to floating point errors.

pytorch defaults to float32 and that increases the errors (the speeds are similar for float64 ~18s)
setting to float64 still does not produce exact results as earlier implementation, although very close to zero zero (I think this is due to the order of operations)
I ran a few times on local CPU (8 core i5, no GPU) and saw modest speedups ~20%
if the preprocessing is run on GPU then it might be best to not use multiprocessing in the dataloader by setting num_workers=0, because this could lead to out of memory errors
- if multiprocessing is to be used then torch.multiprocessing.set_start_method('spawn') would have to set - https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing

Comments

creatmat implementation in ufold/utils.py should be faster than the one in data_generator.py by 3x-4x (I ran a few times and got this number). I think this is because the processing is memory bound and when the input data is represented by one-hot 64-bit floats, each nucleotide is takes up 4x64 bits = 8 bytes. But if we used strings instead that would be 1 byte per nt
The vectorized implementation is not nearly as readable as the existing one
Training should also be similarly faster with this implementation unless the processing is cached (I am not sure how that's handled)

uci-cbcl / UFold

Preprocessing in GPU #30

Speed

Details

Comments