Open juancq opened 3 months ago
@juancq Why do you want to have min_seq_len=1
? I'm not suggesting that we shouldn't solve this issue -- we should, but in what circumstance would it be useful to try to model patients for whom you literally only have one observation?
I suppose you are using this in a strictly downstream task setting, where making a prediction after only a single observation is reasonable?
@mmcdermott yes, it is for downstream tasks where only a single observation is available and one wants to make a prediction. The nested_ragged_tensors could use with additional checks (but not the one I have suggested).
Digging into the code, I think a single observation ends with st=end=0 here: https://github.com/mmcdermott/EventStreamGPT/blob/ce9e2c3e80c00a79ff6ee53deaee9ec6ca6f2669/EventStream/data/pytorch_dataset.py#L497-L499
which leads to an empty list. Indexing with st:end+1
avoids the index error when min_seq_len=1, but I need to double check whether that's correct
@Oufattole -- the code in MEDS-Torch is based on this code from ESGPT, and you've been looking at the sub-selection more recently than I. Is it possible this should be st:end+1
as @juancq postulates?
Also related, in that when the st:end
sub-slicing is integrated into NRTs directly we need to ensure it is correct: https://github.com/mmcdermott/nested_ragged_tensors/issues/9
Branch: dev
Setting min_seq_len to 1 in PytorchDatasetConfig leads to the following error: