Python-level performance improvements in Dataset - Githubissues

melloddy / SparseChem

Fast and accurate machine learning models for biochemical applications.

MIT License

53 stars 11 forks source link

Python-level performance improvements in Dataset #1

Closed muellren closed 4 years ago

muellren commented 4 years ago

This PR contains some trivial performance (speed) fixes for Dataset in Sparsechem.

It avoids the creation of temporary sparse matrices for every element in the minibatch in the __getitem__ method of the Dataset als used by PyTorch's DataLoader.
The binary labels are transformed from {-1, 1} to {0, 1} once when creating the Dataset instead of getting transformed for every minibatch anew.

These fixes provide a performance improvement of up to 4x for folded inputs. 99% of the gain is from the elision of sparse temporaries.

jaak-s commented 4 years ago

Nice PR, thanks!