Closed Arun-42 closed 1 year ago
Hi there,
We appreciate your contribution to the UFold project. Upon review, we have found your code to be a valuable addition and have subsequently integrated it into our main branch. Thank you once again for your valuable input.
Thanks.
When running inference on a GPU, the runtime is dominate by the preprocessing step - specifically
creatmat
function. The current implementation is not vectorized and runs on CPU. Although using worker threads in dataloader helps, it is still the main bottleneck.I have thus implemented a vectorized
creatmat
function, that can be moved to GPU. This makes inference much faster (10x speedup on some tests I ran).Speed
I ran inference on Colab with a T4 GPU and a 2-core CPU (Xeon) with 20 sequences each of length 600. The numbers below are roughly what i observed running multiple times, they fluctuate by ~10%.
Current implementation: 120s Vectorized GPU: 12s
One run through
creatmat
takes about 0.2s now compared to the earlier 4s.Details
The implementation is the exact same operations as before. But it is numerically not exact due to floating point errors.
num_workers=0
, because this could lead to out of memory errorstorch.multiprocessing.set_start_method('spawn')
would have to set - https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessingComments
creatmat
implementation inufold/utils.py
should be faster than the one indata_generator.py
by 3x-4x (I ran a few times and got this number). I think this is because the processing is memory bound and when the inputdata
is represented by one-hot 64-bit floats, each nucleotide is takes up 4x64 bits = 8 bytes. But if we used strings instead that would be 1 byte per nt