Open RoggeOhta opened 4 weeks ago
When computing the loss, I would expect the value not to be part of a larger tensor but rather being computed as the sum over some other tensors so I wouldn't expect the issue that you mention to be an issue there. Instead I think this is likely to be caused by cuda being an asynchronous api so all the time appears to be spent when retrieving the data rather than when doing the ops themselves. Maybe you can try to run with CUDA_LAUNCH_BLOCKING=1
on your tinyllama example to see if it still appears that the time is spent retrieving the loss?
When computing the loss, I would expect the value not to be part of a larger tensor but rather being computed as the sum over some other tensors so I wouldn't expect the issue that you mention to be an issue there. Instead I think this is likely to be caused by cuda being an asynchronous api so all the time appears to be spent when retrieving the data rather than when doing the ops themselves. Maybe you can try to run with
CUDA_LAUNCH_BLOCKING=1
on your tinyllama example to see if it still appears that the time is spent retrieving the loss?
I think this will be an issue because this is counterintuitive. I want to convert a single value tensor(In my view that tensor is a single value, and I may not know where this tensor's origin comes from) but actually, it copies all the underlying storage to the CPU and then combine the layout to give me the answer. This is not efficient and causing real performance trouble
I agree but I think that might be a separate issue compared to what you describe on the loss, I'm just wondering if the issue on the loss has the same origin or a different cause.
I agree but I think that might be a separate issue compared to what you describe on the loss, I'm just wondering if the issue on the loss has the same origin or a different cause.
Oh I see, the description may look a bit hazy to others. I mentioned the loss problem because I found this issue when investigating why my training loop is so slow. then it came down to be when I tried to convert loss to f32. the process of converting a single value tensor to f32 is unbelievably causing ~300ms. Then I spent a few days reading the candle source code and then producing a fix.
Gotcha thanks, I'll have a look at the PR.
any updates?
Bug description: When converting a cuda tensor, which is indexed from an other large tensor to CPU memory scalar, the time consumption of this process is unusually long.
Minimal reproducible example:
In the most recent crate version v0.5.1, the line
let scalar = time_it(|| cell.to_scalar::<f32>().unwrap());
will have the identical time consumption of linelet tensor = time_it(|| tensor.to_device(&cuda_dev).unwrap());
.In my investigation, the problem is because candle's tensor uses a storage + layout approach to represent slice, this approach in most of the time is effective, but it's inefficient when copying a small slice from cuda mem to CPU mem because it will still copy the full storage and use layout to offset to the slice you want. this creates massive latency(Before my attempt to fix ~300ms, after my attempt to fix ~17us).