Closed kallewoof closed 1 month ago
I very much appreciate this.
As I recall there's a whole bunch of synchronization points there, as the converter is swapping stuff to system memory like crazy. I think this change likely just shifts those points around a little bit, which would show up in a CPU profiler but may not affect wall time at the end of the day.
The change looks fine, though, so if it does help I'm happy to merge it.
As I recall there's a whole bunch of synchronization points there, as the converter is swapping stuff to system memory like crazy. I think this change likely just shifts those points around a little bit, which would show up in a CPU profiler but may not affect wall time at the end of the day.
Ultimately it should cause a speed increase as it simply does less stuff (calling .item() once instead of repeatedly).
The change looks fine, though, so if it does help I'm happy to merge it.
I am looking closer at the .cuda() bump and I think I have a few more optimizations for you.
~For example, the test_error
function calls .cuda()
twice for x
and xref
, but I think we can drop the xref
call completely and instead move the results in xtest
back to CPU. That way we circumvent moving the target_states to cuda completely. Testing that change now. Should have results in a bit.~ Edit: That was a downgrade. Misunderstood the sizes of these things. :)
Closing in favor of #456 but will reopen if that PR turns out to not work out.
Edit: looking closer at the profiler output, the .cuda() call is making a rather suspicious jump in time there. The total time is lower so I think this is still an improvement but will do some more checks and then un-draft.
The
.item()
call inside the for loops is quite expensive. It is better to placerfn_sum
on cuda and do.item()
once at the end outside the loop.Before the patch (one single layer iteration):
With this patch: