Misalignment Between Static and Event Sequence DataFrames

Oufattole commented 2 months ago

I've noticed an issue during the tokenization stage on my dataset. Specifically, the static DataFrame for shard 0 (${cohort_dir}/tokenization/schemas/train/0.parquet) has a shape of $45,677$, while the corresponding event sequence DataFrame (${cohort_dir}/tokenization/event_seqs/train/0.parquet) has a shape of $45,687$.

These DataFrames are supposed to be aligned, meaning each index in the static DataFrame should correspond directly to an index in the event sequence DataFrame. However, the shape mismatch suggests that this may not be the case.

I'm planning to reproduce this issue on a dummy dataset to see if it persists. However, I'm curious whether this behavior is expected, as it impacts the assumptions made in the PyTorch dataset class in meds-torch. This class assumes alignment between the static DataFrames and the joint nested ragged tensors, which, as far as I understand, are derived from the event sequence DataFrames.

Could this be a bug, or is there an intended reason for the discrepancy in shapes?

mmcdermott commented 2 months ago

I suspect this may be for patients who don't have any static data, maybe? that's just a guess though. I'm looking now.

mmcdermott commented 2 months ago

Yep -- it is patients who don't have static data, I'm almost certain. This join here uses an "inner" join, when it should probably use a full outer join instead: https://github.com/mmcdermott/MEDS_transforms/blob/main/src/MEDS_transforms/transforms/tokenization.py#L161

You should be able to validate that this causes an issue by adding some patients with no static data to (1) the doctest for the function linked above and (2) the single-stage integration test here: https://github.com/mmcdermott/MEDS_transforms/blob/main/tests/MEDS_Transforms/test_tokenization.py

mmcdermott / MEDS_transforms

Misalignment Between Static and Event Sequence DataFrames #197