mmcdermott / MEDS_transforms

A simple set of MEDS polars-based ETL and transformation functions
MIT License
15 stars 3 forks source link

Normalization stage is checking for aggregate_code_metadata/codes.parqet columns and metadata/codes.parquet columns in data/codes.parquet #147

Closed Oufattole closed 1 month ago

Oufattole commented 1 month ago

The normalization stage is failing for me because there is no data/codes.parquet file.

When I try to copy over the metadata/codes/parquet file: cp "${MEDS_DIR}/data/metadata/codes.parquet" "${MEDS_DIR}/data/codes.parquet" I get an error that there is no values/sum column

And when I try to copy over the aggregate_code_metadata/codes.parquet: cp "${MEDS_DIR}/aggregate_code_metadata/codes.parquet" "${MEDS_DIR}/data/codes.parquet" I get an error that there is no "code/vocab_index" column.

What worked for me as a temporary solution was to spin up a simple hydra script to generate a code/vocab_index column:

import hydra
from hydra.core.config_store import ConfigStore
import polars as pl
from loguru import logger
from omegaconf import DictConfig, MISSING

@dataclass
class Config:
    meds_dir: str = MISSING

cs = ConfigStore.instance()
# Registering the Config class with the name `postgresql` with the config group `db`
cs.store(name="config", node=Config)

@hydra.main(version_base=None, config_name="config")
def main(cfg: Config):
    meds_dir = Path(cfg.meds_dir)
    df = pl.read_parquet(meds_dir / "aggregate_code_metadata/codes.parquet")
    df.with_row_index("code/vocab_index").write_parquet(meds_dir / "data/codes.parquet")
    logger.info("Done adding code/vocab_index column to codes.parquet!")

if __name__ == "__main__":
    main()

This issue exists on the dev branch and on release 0.0.4

mmcdermott commented 1 month ago

I think this line should be removed: https://github.com/mmcdermott/MEDS_transforms/blob/158_fix_typing_issue/src/MEDS_transforms/configs/stage_configs/fit_vocabulary_indices.yaml#L4

that may not be the entire problem, but I suspect it is part

mmcdermott commented 1 month ago

I believe this line: https://github.com/mmcdermott/MEDS_transforms/blob/158_fix_typing_issue/src/MEDS_transforms/utils.py#L307 should point to "reducer_output_dir" not "output_dir"

mmcdermott commented 1 month ago

And clearly a multi-stage, multi-metadata stage integration test is also needed, not just singleton stage testers.

mmcdermott commented 1 month ago

Subsidiary issues:

mmcdermott commented 1 month ago

Fixed by #167 and verified with a full, E2E preprocess pipeline integration test.