hoon9405 / DescEmb

DescEmb - Unifying Heterogenous Electronic Health Records Systems via Text-Based Code Embedding
MIT License
20 stars 6 forks source link

Directory structure incorrect for pooled preprocessing #10

Closed constantinkappel closed 6 months ago

constantinkappel commented 6 months ago

Thanks for putting so many updates recently!

I am running preprocessing from your latest commit 499d9d6. I am using ./run_example/preprocessing_run.sh with the following modifications

INPUT_PATH=/myfolder
OUTPUT_PATH=/myfolder/output
DX_PATH=$INPUT_PATH/ccs_multi_dx_tool_2015.csv

python ../preprocess/preprocess_main.py \
    --src_data mimiciii \
    --dataset_path $INPUT_PATH/mimic \
    --ccs_dx_tool_path $DX_PATH \
    --dest_path $OUTPUT_PATH ;

python ../preprocess/preprocess_main.py \
    --src_data eicu \
    --dataset_path $INPUT_PATH/eicu \
    --ccs_dx_tool_path $DX_PATH \
    --dest_path $OUTPUT_PATH ;

python ../preprocess/preprocess_main.py \
    --src_data pooled \
    --ccs_dx_tool_path $DX_PATH \
    --dest_path $OUTPUT_PATH ;

python ../preprocess/preprocess_main.py \
    --src_data mimiciii \
    --dest_path $OUTPUT_PATH \
    --ccs_dx_tool_path $DX_PATH \
    --data_type pretrain ;

python ../preprocess/preprocess_main.py \
    --src_data eicu \
    --dest_path $OUTPUT_PATH \
    --ccs_dx_tool_path $DX_PATH \
    --data_type pretrain ;

python ../preprocess/preprocess_main.py \
    --src_data pooled \
    --dest_path $OUTPUT_PATH \
    --ccs_dx_tool_path $DX_PATH \
    --data_type pretrain ;

While doing pooled preprocessing I get:

preprocessing for eicu has been done.                                                                                                                                                                       
working directory .. :  /myfolder/DescEmb/run_example                                                                                                                                                      
Destination directory is set to /myfolder/output                                                                                                                                                           
pooled data generation                                                                                                                                                                                      
pooled fold generation                                                                                                                                                                                      
pooled label generation                                                                                                                                                                                     
pooled text and code input generation                                                                                                                                                                       
pooled data generation has been done.                                                                                                                                                                       
preprocessing for pooled has been done.                                                                                                                                                                     
working directory .. :  /myfolder/DescEmb/run_example                                                                                                                                                      
Destination directory is set to /myfolder/output                                                                                                                                                           
Traceback (most recent call last):                                                                                                                                                                          
  File "/myfolder/DescEmb/run_example/../preprocess/preprocess_main.py", line 222, in <module>                                                                                                             
    main()                                                                                                                                                                                                  
  File "/myfolder/DescEmb/run_example/../preprocess/preprocess_main.py", line 172, in main                                                                                                                 
    create_cohort(                                                                                                                                                                                          
  File "/myfolder/DescEmb/preprocess/create_cohort.py", line 20, in create_cohort                                                                                                                          
    function_mapping[src_data](dataset_path, dest_path, ccs_dx_tool_path, icd10to9_path, min_stay_hours)                                                                                                    
  File "/myfolder/DescEmb/preprocess/create_cohort.py", line 27, in create_MIMIC_cohort                                                                                                                    
    if not os.path.exists(dataset_path):                                                                                                                                                                    
  File "/root/miniconda3/envs/py3.10/lib/python3.10/genericpath.py", line 19, in exists                                                                                                                     
    os.stat(path)                                                                                                                                                                                           
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType  
working directory .. :  /myfolder/DescEmb/run_example                                                                                                                                              [0/1922]
Destination directory is set to /myfolder/output                                                     
Traceback (most recent call last):                                                                    
  File "/myfolder/DescEmb/run_example/../preprocess/preprocess_main.py", line 222, in <module>       
    main()                                                                                            
  File "/myfolder/DescEmb/run_example/../preprocess/preprocess_main.py", line 172, in main           
    create_cohort(                                                                                    
  File "/myfolder/DescEmb/preprocess/create_cohort.py", line 20, in create_cohort                    
    function_mapping[src_data](dataset_path, dest_path, ccs_dx_tool_path, icd10to9_path, min_stay_hours)
  File "/myfolder/DescEmb/preprocess/create_cohort.py", line 127, in create_eICU_cohort              
    if not os.path.exists(dataset_path):                                                              
  File "/root/miniconda3/envs/py3.10/lib/python3.10/genericpath.py", line 19, in exists               
    os.stat(path)                                                                                     
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType                   
working directory .. :  /myfolder/DescEmb/run_example                                                
Destination directory is set to /myfolder/output                                                     
pooled data generation                                                                                
Traceback (most recent call last):                                                                    
  File "/myfolder/DescEmb/run_example/../preprocess/preprocess_main.py", line 222, in <module>       
    main()                                                                                            
  File "/myfolder/DescEmb/run_example/../preprocess/preprocess_main.py", line 211, in main           
    pooled_data_generation(                                                                           
  File "/myfolder/DescEmb/preprocess/numpy_convert.py", line 200, in pooled_data_generation          
    mimic_df = pd.read_pickle(os.path.join(dest_path, 'mimiciii_df.pkl'))                             
  File "/myfolder/DescEmb/.venv/lib/python3.10/site-packages/pandas/io/pickle.py", line 185, in read_pickle
    with get_handle(                                                                                  
  File "/myfolder/DescEmb/.venv/lib/python3.10/site-packages/pandas/io/common.py", line 882, in get_handle
    handle = open(handle, ioargs.mode)                                                                
FileNotFoundError: [Errno 2] No such file or directory: '/myfolder/output/mlm/mimiciii_df.pkl'       

I checked the file system and a folder /myfolder/output/mlm/pooled was created, but there is no mimiciii_df.pkl in there. Rather, there is such a file in /myfolder/output/. Is it OK to just set a symbolic link in /myfolder/output/mlm as a workaround or are these supposed to be entirely different files.

hoon9405 commented 6 months ago

Thank you for making issue about it.

Whether it's for prediction or for pretraining, inputs for MIMIC-III and eICU must be generated sequentially before preparing the pooled input. This is because the pooled input merges the inputs from MIMIC-III and eICU.

When running the preprocessing code for the pretrain dataset in the preprocess_run.sh at the example run directory, try specifying the dataset_path as shown below.
(dataset_path is same as the path used in predict dataset preparation.)

Also, please use the same "dest_path" as the predict dataset used. This will create a subdirectory named "mlm" at "dest_path".

I added "dataset_path" at the example run code.

python ../preprocess/preprocess_main.py \
    --src_data mimiciii \
    --dataset_path /user/mimiciii \
    --dest_path /user/descemb/dataset \
    --data_type pretrain ;

python ../preprocess/preprocess_main.py \
    --src_data eicu \
    --dataset_path /user/eicu \
    --dest_path /user/descemb/dataset \
    --data_type pretrain ;

python ../preprocess/preprocess_main.py \
    --src_data pooled \
    --dest_path /user/descemb/dataset \
    --data_type pretrain ;
constantinkappel commented 6 months ago

Thank you so much, @hoon9405, for getting back so quickly!

I tried again with your changes. Essentially, it meant for me to include the dataset_path also in pretraining.

I am sharing my version of your script, in case it might help somebody:

INPUT_PATH=/home/user/data
OUTPUT_PATH=/home/user/data/output
DX_PATH=$INPUT_PATH/ccs_multi_dx_tool_2015.csv

python ../preprocess/preprocess_main.py \
    --src_data mimiciii \
    --dataset_path $INPUT_PATH/mimic \
    --ccs_dx_tool_path $DX_PATH \
    --dest_path $OUTPUT_PATH ;

python ../preprocess/preprocess_main.py \
    --src_data eicu \
    --dataset_path $INPUT_PATH/eicu \
    --ccs_dx_tool_path $DX_PATH \
    --dest_path $OUTPUT_PATH ;

python ../preprocess/preprocess_main.py \
    --src_data pooled \
    --ccs_dx_tool_path $DX_PATH \
    --dest_path $OUTPUT_PATH ;

python ../preprocess/preprocess_main.py \
    --src_data mimiciii \
    --dataset_path $INPUT_PATH/mimic \
    --dest_path $OUTPUT_PATH \
    --ccs_dx_tool_path $DX_PATH \
    --data_type pretrain ;

python ../preprocess/preprocess_main.py \
    --src_data eicu \
    --dataset_path $INPUT_PATH/eicu \
    --dest_path $OUTPUT_PATH \
    --ccs_dx_tool_path $DX_PATH \
    --data_type pretrain ;

python ../preprocess/preprocess_main.py \
    --src_data pooled \
    --dest_path $OUTPUT_PATH \
    --ccs_dx_tool_path $DX_PATH \
    --data_type pretrain ;
constantinkappel commented 6 months ago

Closing issue.