Open mmcdermott opened 5 months ago
This capability is being integrated into MEDS polars directly here: https://github.com/mmcdermott/MEDS_polars_functions/pull/18
Example output:
mbm47 in compute-a-17-72 in MEDS_polars_functions on describe_codes_post_extraction [$] is v0.0.1 via v3.12.3 via MEDS_pipelines took 9s
❯ ./scripts/extraction/collect_code_metadata.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_PRE_MEDS_DIR" cohort_dir="$MIMICIV_MEDS_DIR/3workers_slurm" event_con
version_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
[2024-06-13 15:16:58,788][HYDRA] Joblib.Parallel(n_jobs=-1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes
=None,mmap_mode=r) is launching 3 jobs
[2024-06-13 15:16:58,789][HYDRA] Launching jobs, sweep output dir : /n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_compute_tests/3workers_slurm/.logs/collect_code_metadata
[2024-06-13 15:16:58,789][HYDRA] #0 : worker=0 input_dir=/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_raw_files/2.2 cohort_dir=/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_compute_tests//3workers
_slurm event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
[2024-06-13 15:16:58,789][HYDRA] #1 : worker=1 input_dir=/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_raw_files/2.2 cohort_dir=/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_compute_tests//3workers
_slurm event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
[2024-06-13 15:16:58,789][HYDRA] #2 : worker=2 input_dir=/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_raw_files/2.2 cohort_dir=/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_compute_tests//3workers
_slurm event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
2024-06-13 15:17:00.434 | INFO | __main__:main:25 - Running with config:
input_dir: /n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_raw_files/2.2
cohort_dir: /n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_compute_tests//3workers_slurm
_default_description: 'This is a MEDS pipeline ETL. Please set a more detailed description
...
2024-06-13 15:17:00.551 | INFO | __main__:main:76 - All map shards complete! Starting code metadata reduction computation.
2024-06-13 15:17:00.879 | INFO | __main__:main:81 - Finished reduction in 0:00:00.327656
mbm47 in compute-a-17-72 in MEDS_polars_functions on describe_codes_post_extraction [$] is v0.0.1 via v3.12.3 via MEDS_pipelines took 5s
❯ python
Python 3.12.3 | packaged by Anaconda, Inc. | (main, May 6 2024, 19:46:43) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> from pathlib import Path
>>> fp = Path("/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_compute_tests/3workers_slurm")
>>> pl.read_parquet(fp / "code_metadata.parquet", use_pyarrow=True)
shape: (77_177, 6)
┌─────────────────────────────────┬────────────────────┬─────────────────┬──────────────────────┬────────────┬────────────────┐
│ code ┆ code/n_occurrences ┆ code/n_patients ┆ values/n_occurrences ┆ values/sum ┆ values/sum_sqd │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ u32 ┆ u32 ┆ u32 ┆ f64 ┆ f64 │
╞═════════════════════════════════╪════════════════════╪═════════════════╪══════════════════════╪════════════╪════════════════╡
│ null ┆ 501558412 ┆ 299712 ┆ 223556058 ┆ 1.6846e10 ┆ 3.7424e15 │
│ LAB//228709//UNK ┆ 6496 ┆ 1038 ┆ 0 ┆ 0.0 ┆ 0.0 │
...
See, e.g., https://github.com/mmcdermott/MEDS_polars_functions/blob/main/scripts/preprocessing/collect_code_metadata.py