Hey @EvenOldridge , I stumbled upon your post about cudf dataloaders speeding up training a long time ago... and recently got around to actually trying my hand at it, so thanks for introducing me to the idea!
I'm just curious, but is there a specific reason you implemented the batchdataloaders like you did in this repo instead of using a custom sampler + regular DataLoader with 0 workers?
I understand that your implementation might already have changed significantly since you mentioned integrating it with fast.ai, but I was curious if you ruled this out for any specific reason that I can't clearly see atm. I would think it has the same performance capabilities?
Edit: The biggest difference might be we can grab each batch as a single read from contiguous memory? Did you test how large the impact of this was?
Hey @EvenOldridge , I stumbled upon your post about cudf dataloaders speeding up training a long time ago... and recently got around to actually trying my hand at it, so thanks for introducing me to the idea!
I'm just curious, but is there a specific reason you implemented the batchdataloaders like you did in this repo instead of using a custom sampler + regular
DataLoader
with 0 workers?I wrote out a quick sketch of the implementation I'm thinking of here: https://gist.github.com/NegatioN/1f63c3a79dfe13b183d413123d37d4fa
I understand that your implementation might already have changed significantly since you mentioned integrating it with fast.ai, but I was curious if you ruled this out for any specific reason that I can't clearly see atm. I would think it has the same performance capabilities?
Edit: The biggest difference might be
we can grab each batch as a single read from contiguous memory
? Did you test how large the impact of this was?/Joakim