Closed juancq closed 1 month ago
@juancq -- why would you want to set max_seq_len
to < 2? Clearly this should still not error in this way, of course, but that seems like a very niche use-case. Also, is it the case that max_seq_len
is then less than min_seq_len
, which might cause other issues (and should fail more gracefully, if so)? Additionally, apologies for the delay in responding.
@juancq just re-read the title of this one. Is there a typo in the detailed description (should it be min_seq_len
)? I'll presume so given the title and look into this. Oh, and is this on the dev branch or the main branch?
@mmcdermott You are right, I made a spelling mistake in the detailed description. It should be min_seq_len. This is on the dev branch. Use case is wanting generate embeddings even for people with a single event (for downstream tasks).
I have moved on to another dataset, and I have not encountered this bug again. I'm happy to close this for now and reopen if I encounter it again.
Sounds good, thank you!
Setting min_seq_len to < 2 in PytorchDatasetConfig with task dataframes leads to keyerror here: https://github.com/mmcdermott/EventStreamGPT/blob/2f433a695112fdccb7b28a50cb44b6f39fce4349/EventStream/data/pytorch_dataset.py#L464
where 1717150 is a valid key in DL_shards.json.