mmcdermott / EventStreamGPT

Dataset and modelling infrastructure for modelling "event streams": sequences of continuous time, multivariate events with complex internal dependencies.
https://eventstreamml.readthedocs.io/en/latest/
MIT License
98 stars 15 forks source link

Setting min_seq_len in PytorchDatasetConfig to < 2 with task dataframes leads to keyerror #109

Closed juancq closed 1 month ago

juancq commented 4 months ago

Setting min_seq_len to < 2 in PytorchDatasetConfig with task dataframes leads to keyerror here: https://github.com/mmcdermott/EventStreamGPT/blob/2f433a695112fdccb7b28a50cb44b6f39fce4349/EventStream/data/pytorch_dataset.py#L464

KeyError: 1717150

where 1717150 is a valid key in DL_shards.json.

mmcdermott commented 3 months ago

@juancq -- why would you want to set max_seq_len to < 2? Clearly this should still not error in this way, of course, but that seems like a very niche use-case. Also, is it the case that max_seq_len is then less than min_seq_len, which might cause other issues (and should fail more gracefully, if so)? Additionally, apologies for the delay in responding.

mmcdermott commented 3 months ago

@juancq just re-read the title of this one. Is there a typo in the detailed description (should it be min_seq_len)? I'll presume so given the title and look into this. Oh, and is this on the dev branch or the main branch?

juancq commented 3 months ago

@mmcdermott You are right, I made a spelling mistake in the detailed description. It should be min_seq_len. This is on the dev branch. Use case is wanting generate embeddings even for people with a single event (for downstream tasks).

juancq commented 1 month ago

I have moved on to another dataset, and I have not encountered this bug again. I'm happy to close this for now and reopen if I encounter it again.

mmcdermott commented 1 month ago

Sounds good, thank you!