Setting min_seq_len=1 in PytorchDatasetConfig with task dataframes leads to ragged tensors index error

mmcdermott / EventStreamGPT

Dataset and modelling infrastructure for modelling "event streams": sequences of continuous time, multivariate events with complex internal dependencies.

https://eventstreamml.readthedocs.io/en/latest/

MIT License

102 stars 16 forks source link

Setting min_seq_len=1 in PytorchDatasetConfig with task dataframes leads to ragged tensors index error #118

Open juancq opened 3 months ago

juancq commented 3 months ago

Branch: dev

Setting min_seq_len to 1 in PytorchDatasetConfig leads to the following error:

File ragged_numpy.py, line 818, in vstack
return cls.concatenate([T.unsqueeze(0) for T in tensor_dicts])
File ragged_numpy.py, line 912, in concatenate
(out_tensors[bounds_key], T.tensors[bounds_key] + out_tensors[bounds_key][-1])

IndexError: index -1 is out of bounds for axis 0 with size 0

mmcdermott commented 2 months ago

@juancq Why do you want to have min_seq_len=1? I'm not suggesting that we shouldn't solve this issue -- we should, but in what circumstance would it be useful to try to model patients for whom you literally only have one observation?

I suppose you are using this in a strictly downstream task setting, where making a prediction after only a single observation is reasonable?

juancq commented 2 months ago

@mmcdermott yes, it is for downstream tasks where only a single observation is available and one wants to make a prediction. The nested_ragged_tensors could use with additional checks (but not the one I have suggested).

Digging into the code, I think a single observation ends with st=end=0 here: https://github.com/mmcdermott/EventStreamGPT/blob/ce9e2c3e80c00a79ff6ee53deaee9ec6ca6f2669/EventStream/data/pytorch_dataset.py#L497-L499

which leads to an empty list. Indexing with st:end+1 avoids the index error when min_seq_len=1, but I need to double check whether that's correct

mmcdermott commented 2 months ago

@Oufattole -- the code in MEDS-Torch is based on this code from ESGPT, and you've been looking at the sub-selection more recently than I. Is it possible this should be st:end+1 as @juancq postulates?

mmcdermott commented 2 months ago

Also related, in that when the st:end sub-slicing is integrated into NRTs directly we need to ensure it is correct: https://github.com/mmcdermott/nested_ragged_tensors/issues/9