Normalization stage is checking for aggregate_code_metadata/codes.parqet columns and metadata/codes.parquet columns in data/codes.parquet

Oufattole commented 1 month ago

The normalization stage is failing for me because there is no data/codes.parquet file.

When I try to copy over the metadata/codes/parquet file: cp "${MEDS_DIR}/data/metadata/codes.parquet" "${MEDS_DIR}/data/codes.parquet" I get an error that there is no values/sum column

And when I try to copy over the aggregate_code_metadata/codes.parquet: cp "${MEDS_DIR}/aggregate_code_metadata/codes.parquet" "${MEDS_DIR}/data/codes.parquet" I get an error that there is no "code/vocab_index" column.

What worked for me as a temporary solution was to spin up a simple hydra script to generate a code/vocab_index column:

import hydra
from hydra.core.config_store import ConfigStore
import polars as pl
from loguru import logger
from omegaconf import DictConfig, MISSING

@dataclass
class Config:
    meds_dir: str = MISSING

cs = ConfigStore.instance()
# Registering the Config class with the name `postgresql` with the config group `db`
cs.store(name="config", node=Config)

@hydra.main(version_base=None, config_name="config")
def main(cfg: Config):
    meds_dir = Path(cfg.meds_dir)
    df = pl.read_parquet(meds_dir / "aggregate_code_metadata/codes.parquet")
    df.with_row_index("code/vocab_index").write_parquet(meds_dir / "data/codes.parquet")
    logger.info("Done adding code/vocab_index column to codes.parquet!")

if __name__ == "__main__":
    main()

This issue exists on the dev branch and on release 0.0.4