mit-han-lab / torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
https://torchsparse.mit.edu
MIT License
1.19k stars 138 forks source link

Any plan to support bfloat16? #276

Closed pycoco closed 5 months ago

zhijian-liu commented 9 months ago

@ys-2020, could you please take a look at this issue when you have time? Thanks!

ys-2020 commented 9 months ago

Hi @pycoco , thanks for your interest. bfloat16 is typically used for training jobs. However, we have launched many training jobs and found that float16 will not affect the accuracy. That's why we do not support bfloat16 now.

If you find there is any job that bfloat16 can have better training results, please let us know. And we will plan to implement it.

pycoco commented 9 months ago

@ys-2020 Thanks for your quick reply and great work,i found that model with float16 will encounter loss nan problem in certain scenarios. Maybe it is caused by underflow/overflow. So this a good choice to support bfloat16 in training.

ys-2020 commented 9 months ago

@pycoco . Hi! Thank you for the feedback. Can you provide more details about the 'certain scenario'? Actually we have launched a lot of training jobs on segmentation/detection tasks and many different datasets, and we didn't meet the nan loss. (Also, you can change to fp32 as a backup plan for now.)

pycoco commented 9 months ago

@ys-2020 In my scenario, i use voxelnext with voxel size [0.05, 0.05, 0.15], range [-100.0, -100.0, -1.5, 100.0, 100.0, 4.5] and own dataset. FP32 is normal but training time is too long. Actually i use 'spconv' now, maybe i should adapt to our library and have a try(but i think the library is not the reason of the problem).