justin13601 / ACES

ACES: Automatic Cohort Extraction System for Event-Streams
https://eventstreamaces.readthedocs.io/en/latest/
MIT License
25 stars 1 forks source link

Preferred way to query directory of converted MEDS data #113

Closed rvandewater closed 2 months ago

rvandewater commented 2 months ago

Hi!

According to the documentation, it is possible to use a folder as input; however, when I try to use a folder instead of a path it throws the following error:

(ACES) robin.vandewater@kottos:~/projects/MEDS_TAB_AUMC$ python in_icu_query.py
2024-08-23 11:14:49.982 | INFO     | aces.config:load:1232 - Parsing windows...
2024-08-23 11:14:49.982 | INFO     | aces.config:load:1241 - Parsing trigger event...
2024-08-23 11:14:49.982 | INFO     | aces.config:load:1256 - Parsing predicates...
2024-08-23 11:14:50.375 | INFO     | aces.predicates:generate_plain_predicates_from_meds:318 - Loading MEDS data...
Traceback (most recent call last):
  File "/dhc/home/robin.vandewater/projects/MEDS_TAB_AUMC/in_icu_query.py", line 15, in <module>
    predicates_df = predicates.get_predicates_df(cfg=cfg, data_config=data_config)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dhc/home/robin.vandewater/conda3/envs/ACES/lib/python3.12/site-packages/aces/predicates.py", line 687, in get_predicates_df
    data = generate_plain_predicates_from_meds(data_path, plain_predicates)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dhc/home/robin.vandewater/conda3/envs/ACES/lib/python3.12/site-packages/aces/predicates.py", line 319, in generate_plain_predicates_from_meds
    data = pl.read_parquet(data_path).rename({"patient_id": "subject_id", "time": "timestamp"})
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dhc/home/robin.vandewater/conda3/envs/ACES/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 135, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dhc/home/robin.vandewater/conda3/envs/ACES/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 135, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dhc/home/robin.vandewater/conda3/envs/ACES/lib/python3.12/site-packages/polars/io/parquet/functions.py", line 173, in read_parquet
    lf = scan_parquet(
         ^^^^^^^^^^^^^
  File "/dhc/home/robin.vandewater/conda3/envs/ACES/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 135, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dhc/home/robin.vandewater/conda3/envs/ACES/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 135, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dhc/home/robin.vandewater/conda3/envs/ACES/lib/python3.12/site-packages/polars/io/parquet/functions.py", line 384, in scan_parquet
    source = normalize_filepath(source)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dhc/home/robin.vandewater/conda3/envs/ACES/lib/python3.12/site-packages/polars/_utils/various.py", line 197, in normalize_filepath
    raise IsADirectoryError(msg)
IsADirectoryError: expected a file path; '/dhc/home/robin.vandewater/datasets/AUMC/MEDS/data/train' is a directory

Script is simple:

from aces import config, predicates, query
from omegaconf import DictConfig

# create task configuration object
cfg = config.TaskExtractorConfig.load(config_path="/dhc/home/robin.vandewater/projects/MEDS-DEV-AUMC/src/MEDS-DEV/tasks/criteria/mortality/in_icu/first_24h_aumc.yaml")

# get predicates dataframe
data_config = DictConfig(
    {
        "path": "/dhc/home/robin.vandewater/datasets/AUMC/MEDS/data/train/",
        "standard": "meds",
        # "ts_format": "%m/%d/%Y %H:%M",
    }
)
predicates_df = predicates.get_predicates_df(cfg=cfg, data_config=data_config)

# execute query and get results
df_result = query.query(cfg=cfg, predicates_df=predicates_df)

Is this desired behaviour?

mmcdermott commented 2 months ago

@rvandewater, use the ACES CLI. See https://gist.github.com/mmcdermott/c48fda0d25be2465cc039d1986be6fd3 for example scripts, either for running locally, in parallel on a single machine, or across slurm cluster nodes. See https://docs.google.com/document/d/1O9phROCjDaPWj6fkrelTkHeYigUXopXsth5hIR93ft4/edit#heading=h.7azw1viitnh for instructions on using that example.

Your actual script will look like:

PATH="$CONDA_PATH:$PATH" aces-cli --multirun \
  data=sharded \
  data.standard=meds \
  data.root="$DATA_DIR" \
  "data.shard=$(expand_shards $DATA_DIR)" \
  cohort_dir=$COHORT_DIR \
  cohort_name=$TASK_NAME

The reason it is important this happens through the CLI is because hydra handles re-mapping ACES jobs out across all the shards -- ACES internally doesn't have any code for going over the different files itself, by design.