mmcdermott / MEDS_Tabular_AutoML

Limited automatic tabular ML pipelines for generic MEDS datasets.
MIT License
11 stars 3 forks source link

We should outsource the code metadata dataframe calculation to core MEDS utilities. #28

Open mmcdermott opened 5 months ago

mmcdermott commented 5 months ago

See, e.g., https://github.com/mmcdermott/MEDS_polars_functions/blob/main/scripts/preprocessing/collect_code_metadata.py

mmcdermott commented 4 months ago

This capability is being integrated into MEDS polars directly here: https://github.com/mmcdermott/MEDS_polars_functions/pull/18

Example output:

mbm47 in  compute-a-17-72 in MEDS_polars_functions on  describe_codes_post_extraction [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 9s                                            
❯ ./scripts/extraction/collect_code_metadata.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_PRE_MEDS_DIR" cohort_dir="$MIMICIV_MEDS_DIR/3workers_slurm" event_con
version_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml                                                                                                                               
[2024-06-13 15:16:58,788][HYDRA] Joblib.Parallel(n_jobs=-1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes
=None,mmap_mode=r) is launching 3 jobs                                                                                                                                                        
[2024-06-13 15:16:58,789][HYDRA] Launching jobs, sweep output dir : /n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_compute_tests/3workers_slurm/.logs/collect_code_metadata                           
[2024-06-13 15:16:58,789][HYDRA]        #0 : worker=0 input_dir=/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_raw_files/2.2 cohort_dir=/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_compute_tests//3workers
_slurm event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml                                                                                                               
[2024-06-13 15:16:58,789][HYDRA]        #1 : worker=1 input_dir=/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_raw_files/2.2 cohort_dir=/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_compute_tests//3workers
_slurm event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml                                                                                                               
[2024-06-13 15:16:58,789][HYDRA]        #2 : worker=2 input_dir=/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_raw_files/2.2 cohort_dir=/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_compute_tests//3workers
_slurm event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml                                                                                                               
2024-06-13 15:17:00.434 | INFO     | __main__:main:25 - Running with config:                                                                                                                  
input_dir: /n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_raw_files/2.2                                                                                                                               
cohort_dir: /n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_compute_tests//3workers_slurm                                                                                                              
_default_description: 'This is a MEDS pipeline ETL. Please set a more detailed description                                                                                                    
...
2024-06-13 15:17:00.551 | INFO     | __main__:main:76 - All map shards complete! Starting code metadata reduction computation.                                                                
2024-06-13 15:17:00.879 | INFO     | __main__:main:81 - Finished reduction in 0:00:00.327656   
mbm47 in  compute-a-17-72 in MEDS_polars_functions on  describe_codes_post_extraction [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 5s                                            
❯ python                                                                                                                                                                                      
Python 3.12.3 | packaged by Anaconda, Inc. | (main, May  6 2024, 19:46:43) [GCC 11.2.0] on linux                                                                                              
Type "help", "copyright", "credits" or "license" for more information.                                                                                                                        
>>> import polars as pl                                                                                                                                                                       
>>> from pathlib import Path                                                                                                                                                                  
>>> fp = Path("/n/data1/hms/dbmi/zaklab/MIMIC-IV/MEDS_compute_tests/3workers_slurm")                                                                                                          
>>> pl.read_parquet(fp / "code_metadata.parquet", use_pyarrow=True)                                                                                                                           
shape: (77_177, 6)                                                                                                                                                                            
┌─────────────────────────────────┬────────────────────┬─────────────────┬──────────────────────┬────────────┬────────────────┐                                                               
│ code                            ┆ code/n_occurrences ┆ code/n_patients ┆ values/n_occurrences ┆ values/sum ┆ values/sum_sqd │                                                               
│ ---                             ┆ ---                ┆ ---             ┆ ---                  ┆ ---        ┆ ---            │                                                               
│ cat                             ┆ u32                ┆ u32             ┆ u32                  ┆ f64        ┆ f64            │                                                               
╞═════════════════════════════════╪════════════════════╪═════════════════╪══════════════════════╪════════════╪════════════════╡                                                               
│ null                            ┆ 501558412          ┆ 299712          ┆ 223556058            ┆ 1.6846e10  ┆ 3.7424e15      │                                                               
│ LAB//228709//UNK                ┆ 6496               ┆ 1038            ┆ 0                    ┆ 0.0        ┆ 0.0            │                                                               
...