sunlabuiuc / PyHealth

A Deep Learning Python Toolkit for Healthcare Applications.
https://pyhealth.readthedocs.io
MIT License
956 stars 207 forks source link

Entering deadlock when parsing prescriptions #150

Closed louis-she closed 1 year ago

louis-she commented 1 year ago

I have tested some basic code from the tutorial with the MIMIC-4 dataset. But the process hanged. I press ctrl-C to exit the program and it gives the following call stacks. Seems like parallel_apply get into a deadlock or something else when parsing prescriptions.

reproducing code

import logging
from pyhealth.datasets import MIMIC4Dataset

logger = logging.getLogger("pyhealth")
logger.setLevel(logging.DEBUG)

dataset = MIMIC4Dataset(
    "/home/featurize/data/mimic-iv-2.2/hosp",
    tables=["diagnoses_icd", "procedures_icd", "prescriptions", "labevents"],
    code_mapping={"NDC": ("ATC", {"target_kwargs": {"level": 3}})},
)

Callstacks after ctrl-C

Loaded NDC->ATC mapping from /home/featurize/.cache/pyhealth/medcode/NDC_to_ATC.pkl                                                                                                                                
Loaded NDC code from /home/featurize/.cache/pyhealth/medcode/NDC.pkl                                     
Loaded ATC code from /home/featurize/.cache/pyhealth/medcode/ATC.pkl                                                                                                                                               
Processing MIMIC4Dataset base dataset...            
INFO: Pandarallel will run on 6 workers.                                                                 
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.                                                                                                               
finish basic patient information parsing : 80.05470561981201s                                            
finish parsing diagnoses_icd : 134.23406291007996s                                                       
finish parsing procedures_icd : 57.97325396537781s                                                       

^CTraceback (most recent call last):                                                                                                                                                                               
  File "main.py", line 7, in <module>
    dataset = MIMIC4Dataset(                                                                             
  File "/home/featurize/work/PyHealth/pyhealth/datasets/base_ehr_dataset.py", line 130, in __init__
    patients = self.parse_tables()                                                                                                                                                                                 
  File "/home/featurize/work/PyHealth/pyhealth/datasets/base_ehr_dataset.py", line 190, in parse_tables
    patients = getattr(self, f"parse_{table.lower()}")(patients)                                                                                                                                                   
  File "/home/featurize/work/PyHealth/pyhealth/datasets/mimic4.py", line 307, in parse_prescriptions
    group_df = group_df.parallel_apply(                                                                                                                                                                            
  File "/environment/miniconda3/envs/py38/lib/python3.8/site-packages/pandarallel/core.py", line 307, in closure
Process ForkPoolWorker-28:                                                                               
Process ForkPoolWorker-31:        
Process ForkPoolWorker-30:                                                                                                                                                                                         
Process ForkPoolWorker-32:          
Process ForkPoolWorker-33:                                                                                                                                                                                         
Process ForkPoolWorker-29:                   
    message: Tuple[int, WorkerStatus, Any] = master_workers_queue.get()                                                                                                                                            
  File "<string>", line 2, in get   
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/managers.py", line 835, in _callmethod  
Traceback (most recent call last):
Traceback (most recent call last):                                                                                                                                                                                 
Traceback (most recent call last):  
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()                    
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap                                                                                                       
    self.run()                    
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()                               
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 108, in run                                                                                                              
    self._target(*self._args, **self._kwargs)
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/pool.py", line 114, in worker                                                                                                              
    task = get()                    
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/pool.py", line 114, in worker    
    task = get()                                                                                                                                                                                                   
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/pool.py", line 114, in worker            
    task = get()
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/queues.py", line 355, in get
    with self._rlock:
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/queues.py", line 356, in get
    res = self._reader.recv_bytes()
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/queues.py", line 355, in get
    with self._rlock:
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/environment/miniconda3/envs/py38/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
ycq091044 commented 1 year ago

Interesting, it tests fine on our end. We will investigate into it in the next few days.

louis-she commented 1 year ago

I changed the parallel_apply to normal apply and everything went well.

ycq091044 commented 1 year ago

sounds good, it might be the issue in the new pandas version then... But the MIMIC-IV data is large and parallel_apply might be necessary to accelerate the process. I will dig into it.

BPDanek commented 1 year ago

which version of pandas do you use?

louis-she commented 1 year ago

pandas==1.5.3 pandarallel==1.6.5

louis-she commented 1 year ago

actually, I think it's an issue of pandarallel, I think we can close the issue here.

jiangxinke commented 1 year ago

What version of pandarallel is needed?

BPDanek commented 1 year ago

The version in requires.txt file on the develop branch.

On Tue, Jun 20, 2023 at 8:27 AM Thinker Jiang @.***> wrote:

What version of pandarallel is needed?

— Reply to this email directly, view it on GitHub https://github.com/sunlabuiuc/PyHealth/issues/150#issuecomment-1598791578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZB7IZXPQI6JRRG5C2LEBLXMGQLJANCNFSM6AAAAAAX2AHZUI . You are receiving this because you commented.Message ID: @.***>

jiangxinke commented 1 year ago

@BPDanek , thanks, I have changed to the specific version of pandas and pandarallel. But the code is still in deadlock:

image

No results. Do you know how to fix this? Thanks for your help !