mmcdermott / EventStreamGPT

Dataset and modelling infrastructure for modelling "event streams": sequences of continuous time, multivariate events with complex internal dependencies.
https://eventstreamml.readthedocs.io/en/latest/
MIT License
102 stars 16 forks source link

Add streaming flag for calls to polars collect #46

Closed juancq closed 1 year ago

juancq commented 1 year ago

I added streaming to calls to .collect in dataset_polars.py to improve performance. It would be helpful to have a boolean flag to set streaming in collect.

juancq commented 1 year ago

@mmcdermott I'm happy to submit an implementation of this. Just need to know your design preference for where to define this type of configuration setting.

mmcdermott commented 1 year ago

Thanks @juancq -- I'll definitely take you up on that, but I'm not 100% sure yet what the design should look like here. My tentative thinking is that this should (by default) always be true in the dataset_polars.py class, as (I don't think) that streaming=True will hurt performance for datasets that can fit in memory. If that is accurate, I think the simplest way to do this is to have, rather than a configuration variable, to have a class option (e.g., a class variable defined in Dataset that dictates this behavior, so that if someone wants to deviate from that they just need to overwrite that class variable prior to instantiating a DatasetObject, but so that we don't clutter the config with parameters that should almost always be set in a particular way.

If it is something that would be controlled more regularly, I would suggest having that class variable be settable via an environment variable as well. I'm hesitant to add stuff to the Dataset Configuration file yet without having a better understanding of how to handle different data processing engines (e.g., a PySpark pipeline variant instead of Polars), but am also open to that as well.

Does the approach I outlined here sound reasonable to you? I'd also think we can do something similar for the parquet format issue as well (https://github.com/mmcdermott/EventStreamGPT/issues/49)

juancq commented 1 year ago

I agree with your approach, it is reasonable at this stage, the simplest to implement, and the less disruptive to the existing codebase.

mmcdermott commented 1 year ago

Fantastic; feel free to make a PR in line with that approach, then. And thank you in advance; both for catching this issue and for contributing a PR! And, to be clear, make the PR off of the dev branch and pushing to that branch, if you please.

On Wed, Aug 2, 2023, 10:42 PM Juan Quiroz Aguilera @.***> wrote:

I agree with your approach, it is reasonable at this stage, the simplest to implement, and the less disruptive to the existing codebase.

— Reply to this email directly, view it on GitHub https://github.com/mmcdermott/EventStreamGPT/issues/46#issuecomment-1663212823, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADS5XZB6YIDWDZ3TE7WAOTXTMFZDANCNFSM6AAAAAA26HLSWM . You are receiving this because you were mentioned.Message ID: @.***>