mmcdermott / EventStreamGPT

Dataset and modelling infrastructure for modelling "event streams": sequences of continuous time, multivariate events with complex internal dependencies.
https://eventstreamml.readthedocs.io/en/latest/
MIT License
94 stars 15 forks source link

Replace use of list for cached_data and subject_ids in pytorch dataset with polars objects #74

Closed juancq closed 7 months ago

juancq commented 8 months ago

This fixes the increased memory consumption issues when using multiple pytorch dataloaders (issue #73). It also dropped the starting memory usage in my test case from 30GB to 12GB.

Removing this line makes all the difference: https://github.com/mmcdermott/EventStreamGPT/blob/b10e7415af1e9ea9517dfb52c343ae8155c40674/EventStream/data/pytorch_dataset.py#L309

Editing the following line didn't make much of a difference, but I edited it for consistency: https://github.com/mmcdermott/EventStreamGPT/blob/b10e7415af1e9ea9517dfb52c343ae8155c40674/EventStream/data/pytorch_dataset.py#L306

mmcdermott commented 8 months ago

Hey @juancq -- how does this impact the iteration speed through the dataloader, though? The motivation to convert things to lists was that with raw polars objects, the base iteration speed was much slower.

juancq commented 8 months ago

@mmcdermott I saw no noticeable difference in the iteration speed.

mmcdermott commented 7 months ago

@juancq I'm working on a different solution for this problem that also addresses some other issues. I'll tag you in that other PR. It's not 100% ready but it is close. It is a larger change, but I'll explain more there.