hpcaitech / FastFold

Optimizing AlphaFold Training and Inference on GPU Clusters
Apache License 2.0
562 stars 86 forks source link

Core affinity issue #175

Open jpetucci opened 1 year ago

jpetucci commented 1 year ago

Description: Running inference with the enable_workflow (Ray Workflow) option causes all processes to be pinned to a single core.

Steps to Reproduce: Follow the conda or container installation instructions and run inference with the --enable_workflow option

Expected Behavior: It is expected that the workload would be spread across the resources available in the Ray Cluster, i.e. processes should run on difference cores

Actual Behavior: All processes and Ray workers have the same cpu/core affinity:

In top, notice the P column is all 0 for FastFold processes

top - 11:12:45 up 11 days, 17:25,  2 users,  load average: 20.82, 17.44, 8.73
Tasks: 734 total,   5 running, 729 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.8 sy,  1.3 ni, 97.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 385438.1 total,   1440.9 free,  18176.0 used, 365821.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 359381.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+  P COMMAND                                                                                                                                                                                                                                                                    
2125171 jmp579    35  15 1782.8g  18.3g   8.8g R  59.6   4.9   2:42.05  0 hhblits                                                                                                                                                                                                                                                                    
2125248 jmp579    35  15 1062268 137828   2996 R  12.3   0.0   1:09.75  0 jackhmmer                                                                                                                                                                                                                                                                  
2125246 jmp579    35  15 1007988  89680   2872 R  11.9   0.0   1:08.86  0 jackhmmer                                                                                                                                                                                                                                                                  
2125247 jmp579    35  15 1025432 105864   2884 R  10.3   0.0   0:58.70  0 jackhmmer                                                                                                                                                                                                                                                                  
2123624 jmp579    20   0  225.7g  86096  13028 S   1.0   0.0   0:02.73  0 raylet                                                                                                                                                                                                                                                                     
2123631 jmp579    20   0  804288 105296  49400 S   1.0   0.0   0:04.43  0 python                                                                                                                                                                                                                                                                     
2123420 jmp579    20   0  129.4g 437152 210308 S   0.7   0.1   0:12.96  4 python                                                                                                                                                                                                                                                                     
2123574 jmp579    20   0 1073304  32960  11948 S   0.7   0.0   0:01.69  0 gcs_server                                                                                                                                                                                                                                                                 
2123706 jmp579    35  15  113.1g 106740  54952 S   0.3   0.0   0:00.73  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123718 jmp579    35  15  113.1g 106980  55192 S   0.3   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123722 jmp579    35  15  113.1g 107372  55492 S   0.3   0.0   0:00.75  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123730 jmp579    35  15  113.1g 107144  55368 S   0.3   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123797 jmp579    35  15  114.8g 116336  61196 S   0.3   0.0   0:00.81  0 ray::_workflow_                                                                                                                                                                                                                                                            
2124995 jmp579    35  15  114.9g 115916  60600 S   0.3   0.0   0:00.72  0 ray::WorkflowMa                                                                                                                                                                                                                                                            
2126067 jmp579    20   0  276204   6036   4428 R   0.3   0.0   0:00.15 35 top                                                                                                                                                                                                                                                                        
2123086 jmp579    20   0  170892   6664   4888 S   0.0   0.0   0:00.01 19 sshd                                                                                                                                                                                                                                                                       
2123087 jmp579    20   0  235076   5316   3620 S   0.0   0.0   0:00.07 27 bash                                                                                                                                                                                                                                                                       
2123310 jmp579    20   0  170892   6664   4884 S   0.0   0.0   0:00.07 28 sshd                                                                                                                                                                                                                                                                       
2123311 jmp579    20   0  234772   4924   3516 S   0.0   0.0   0:00.06 31 bash                                                                                                                                                                                                                                                                       
2123407 jmp579    20   0  220704   3420   3084 S   0.0   0.0   0:00.00 31 bash                                                                                                                                                                                                                                                                       
2123592 jmp579    20   0  952648 103220  49492 S   0.0   0.0   0:00.57  0 python                                                                                                                                                                                                                                                                     
2123599 jmp579    20   0  957180 107896  49792 S   0.0   0.0   0:00.60  0 python                                                                                                                                                                                                                                                                     
2123682 jmp579    20   0  883632 110344  49640 S   0.0   0.0   0:01.02  0 python                                                                                                                                                                                                                                                                     
2123705 jmp579    35  15  113.1g 106984  55200 S   0.0   0.0   0:00.73  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123707 jmp579    35  15  113.1g 107012  55220 S   0.0   0.0   0:00.73  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123708 jmp579    35  15  113.1g 106724  54956 S   0.0   0.0   0:00.73  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123709 jmp579    35  15  113.1g 108988  55236 S   0.0   0.0   0:00.73  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123710 jmp579    35  15  113.1g 109304  55468 S   0.0   0.0   0:00.73  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123711 jmp579    35  15  113.1g 107060  55268 S   0.0   0.0   0:00.73  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123712 jmp579    35  15  113.1g 107492  55716 S   0.0   0.0   0:00.73  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123713 jmp579    35  15  113.1g 106824  55048 S   0.0   0.0   0:00.72  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123714 jmp579    35  15  113.1g 107204  55416 S   0.0   0.0   0:00.73  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123716 jmp579    35  15  113.1g 107036  55248 S   0.0   0.0   0:00.73  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123717 jmp579    35  15  113.1g 108752  54932 S   0.0   0.0   0:00.72  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123719 jmp579    35  15  113.1g 107388  55592 S   0.0   0.0   0:00.74  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123720 jmp579    35  15  113.1g 106728  54956 S   0.0   0.0   0:00.75  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123721 jmp579    35  15  113.1g 107036  55248 S   0.0   0.0   0:00.75  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123723 jmp579    35  15  113.1g 107356  55560 S   0.0   0.0   0:00.75  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123725 jmp579    35  15  113.1g 106744  54964 S   0.0   0.0   0:00.75  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123726 jmp579    35  15  113.1g 106888  55112 S   0.0   0.0   0:00.74  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123727 jmp579    35  15  113.1g 107004  55212 S   0.0   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123728 jmp579    35  15  113.1g 107044  55252 S   0.0   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123729 jmp579    35  15  113.1g 107124  55348 S   0.0   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123731 jmp579    35  15  113.1g 106944  55164 S   0.0   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123732 jmp579    35  15  113.1g 106784  55008 S   0.0   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123733 jmp579    35  15  113.1g 106996  55204 S   0.0   0.0   0:00.77  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123739 jmp579    35  15  113.1g 106872  55100 S   0.0   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123740 jmp579    35  15  113.1g 107312  55536 S   0.0   0.0   0:00.77  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123742 jmp579    35  15  113.1g 107216  55432 S   0.0   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123782 jmp579    35  15  113.1g 106884  55104 S   0.0   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123783 jmp579    35  15  113.1g 107128  55348 S   0.0   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123784 jmp579    35  15  113.1g 107040  55240 S   0.0   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123785 jmp579    35  15  113.1g 107100  55304 S   0.0   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123786 jmp579    35  15  113.1g 107008  55216 S   0.0   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123787 jmp579    35  15  113.1g 108688  54956 S   0.0   0.0   0:00.77  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123788 jmp579    35  15  113.1g 107040  55244 S   0.0   0.0   0:00.76  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123789 jmp579    35  15  113.1g 107216  55436 S   0.0   0.0   0:00.77  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123790 jmp579    35  15  113.1g 106488  54724 S   0.0   0.0   0:00.77  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123791 jmp579    35  15  113.1g 107052  55272 S   0.0   0.0   0:00.77  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123792 jmp579    35  15  113.1g 106908  55132 S   0.0   0.0   0:00.79  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123793 jmp579    35  15  114.8g 116588  61416 S   0.0   0.0   0:00.82  0 ray::_workflow_                                                                                                                                                                                                                                                            
2123794 jmp579    35  15  113.1g 106996  55216 S   0.0   0.0   0:00.79  0 ray::IDLE                                                                                                                                                                                                                                                                  
2123795 jmp579    35  15  114.8g 116376  61224 S   0.0   0.0   0:00.80  0 ray::_workflow_                                                                                                                                                                                                                                                            
2123796 jmp579    35  15  114.8g 115044  61140 S   0.0   0.0   0:00.79  0 ray::_workflow_                                                                                                                                                                                                                                                            
2125095 jmp579    35  15  113.3g 106756  55196 S   0.0   0.0   0:00.65  0 ray::Manager 

This is also confirmed with taskset (output is truncated)

$ for pid in $(ps -ef | grep ray | awk '{print $2}'); do taskset -cp $pid; done
pid 2123574's current affinity list: 0
pid 2123592's current affinity list: 0
pid 2123599's current affinity list: 0
pid 2123624's current affinity list: 0
pid 2123631's current affinity list: 0
pid 2123682's current affinity list: 0
pid 2123705's current affinity list: 0
pid 2123706's current affinity list: 0
pid 2123707's current affinity list: 0
pid 2123708's current affinity list: 0
pid 2123709's current affinity list: 0
pid 2123710's current affinity list: 0

Environment:

Operating System: Red Hat Enterprise Linux 8.7 (Ootpa)
Software Version: Latest, commit id 05681304651b1b29d7d887db169045ea3dd28fce

Steps Taken to Resolve: This seems to be a torch issue see: ray-project/ray/issues/34201 and pytorch/pytorch/issues/99625 . One fix is to set KMP_AFFINITY to disabled before running inference:

KMP_AFFINITY=disabled
Gy-Lu commented 1 year ago

We have noticed this problem also, thanks for the solution :)