exec-port stuck at 100% cpu

axelson commented 4 years ago

When load-testing I've triggered a case where the exec-port OS process is taking 100% CPU (as reported by the top command). This 100% cpu is persisting despite very few requests coming in to the server for over 15 minutes).

I'm not sure what's happening yet but maybe there is an infinite loop that was triggered somehow. It is pretty easy to trigger once I get about 15 requests per second.

Edit: I should note that I am running erlexec 1.17.3 with elixir 1.10.4-otp-22 and erlang 22.3.4.3. I can try again after upgrading erlexec but none of the commits looked relevant to my issue.

saleyn commented 4 years ago

I would need a reproducible test to troubleshoot this. The existing tests include this one, which spawns 300 processes, and sends more than 15 requests per second, but it doesn't seem to tax the CPU in the way you describe. Also what OS are you using? Possibly the problem only manifests itself on that OS.

axelson commented 4 years ago

@saleyn Okay, I've created a reproduction repository for this: https://github.com/axelson/erlexec_repro

Symptoms: erlexec is able to spawn new processes, but at a certain number (which is surprisingly consistent) it just hangs for as long as I let it run, the ExUnit timeout eventually kicks in and the tests finish, but the VM doesn't stop. Here's some sample output:

number: 734
number: 735
number: 736
number: 737
number: 738
number: 739
number: 740
number: 741

  1) test greets the world (ErlexecReproTest)
     test/erlexec_repro_test.exs:5
     ** (exit) exited in: Task.Supervised.stream(30000)
         ** (EXIT) time out
     code: |> Enum.to_list()
     stacktrace:
       (elixir) lib/task/supervised.ex:276: Task.Supervised.stream_reduce/7
       (elixir) lib/enum.ex:3015: Enum.reverse/1
       (elixir) lib/enum.ex:2649: Enum.to_list/1
       test/erlexec_repro_test.exs:18: (test)

Finished in 30.0 seconds
1 test, 1 failure

Randomized with seed 206620

It utilizes dfrotz (https://gitlab.com/DavidGriffith/frotz), although I'd expect that any process that generates I/O (possibly one that doesn't) would work.

The code is written in Elixir, but there isn't too much of it so it should be pretty easy to follow. I don't think the specific elixir and erlang versions make a big difference but I've included the versions I tested with in the .tool-versions asdf-vm file.

The production server I'm using is Heroku which is a version of ubuntu: https://devcenter.heroku.com/articles/heroku-18-stack

I've reproduced this on Ubuntu 19.10 Server, but am unable to reproduce on MacOS 10.15.6

axelson commented 4 years ago

Upon running strace on the spinning exec-port I see a constant stream of EBADF errors, all with the same data

$ uniq -c output.txt
      1 strace: Process 18534 attached
   8254 select(1031, [0 4 8 10 11 13 14 16 17 19 20 22 23 25 26 28 29 31 32 34 35 37 38 40 41 43 44 46 47 49 50 52 53 55 56 58 59 61 62 64 65 67 68 70 71 73 74 76 77 79 80 82 83 85 86 88 89 91 92 94 95 97 98 100 101 103 104 106 107 109 110 112 113 115 116 118 119 121 122 124 125 127 128 130 131 133 134 136 137 139 140 142 143 145 146 148 149 151 152 154 155 157 158 160 161 163 164 166 167 169 170 172 173 175 176 178 179 181 182 184 185 187 188 190 191 193 194 196 197 199 200 202 203 205 206 208 209 211 212 214 215 217 218 220 221 223 224 226 227 229 230 232 233 235 236 238 239 241 242 244 245 247 248 250 251 253 254 256 257 259 260 262 263 265 266 268 269 271 272 274 275 277 278 280 281 283 284 286 287 289 290 292 293 295 296 298 299 301 302 304 305 307 308 310 311 313 314 316 317 319 320 322 323 325 326 328 329 331 332 334 335 337 338 340 341 343 344 346 347 349 350 352 353 355 356 358 359 361 362 364 365 367 368 370 371 373 374 376 377 379 380 382 383 385 386 388 389 391 392 394 395 397 398 400 401 403 404 406 407 409 410 412 413 415 416 418 419 421 422 424 425 427 428 430 431 433 434 436 437 439 440 442 443 445 446 448 449 451 452 454 455 457 458 460 461 463 464 466 467 469 470 472 473 475 476 478 479 481 482 484 485 487 488 490 491 493 494 496 497 499 500 502 503 505 506 508 509 511 512 514 515 517 518 520 521 523 524 526 527 529 530 532 533 535 536 538 539 541 542 544 545 547 548 550 551 553 554 556 557 559 560 562 563 565 566 568 569 571 572 574 575 577 578 580 581 583 584 586 587 589 590 592 593 595 596 598 599 601 602 604 605 607 608 610 611 613 614 616 617 619 620 622 623 625 626 628 629 631 632 634 635 637 638 640 641 643 644 646 647 649 650 652 653 655 656 658 659 661 662 664 665 667 668 670 671 673 674 676 677 679 680 682 683 685 686 688 689 691 692 694 695 697 698 700 701 703 704 706 707 709 710 712 713 715 716 718 719 721 722 724 725 727 728 730 731 733 734 736 737 739 740 742 743 745 746 748 749 751 752 754 755 757 758 760 761 763 764 766 767 769 770 772 773 775 776 778 779 781 782 784 785 787 788 790 791 793 794 796 797 799 800 802 803 805 806 808 809 811 812 814 815 817 818 820 821 823 824 826 827 829 830 832 833 835 836 838 839 841 842 844 845 847 848 850 851 853 854 856 857 859 860 862 863 865 866 868 869 871 872 874 875 877 878 880 881 883 884 886 887 889 890 892 893 895 896 898 899 901 902 904 905 907 908 910 911 913 914 916 917 919 920 922 923 925 926 928 929 931 932 934 935 937 938 940 941 943 944 946 947 949 950 952 953 955 956 958 959 961 962 964 965 967 968 970 971 973 974 976 977 979 980 982 983 985 986 988 989 991 992 994 995 997 998 1000 1001 1003 1004 1006 1007 1009 1010 1012 1013 1015 1016 1018 1019 1021 1022 1024 1025 1027 1028 1030], [0 1 3 4 6], NULL, {tv_sec=5, tv_usec=0}) = -1 EBADF (Bad file descriptor)
      1 strace: Process 18534 detached

saleyn commented 4 years ago

It looks like you are hitting the limit of file descriptors (OS default is 1024). Try running with "ulimit -n 4096" or whatever the limit you want to set, given that spawning each OS process would consume up to 3 file descriptors (for stdin/out/err).

axelson commented 4 years ago

In my production environment I get back 1048576 as the already existing limit which I don't think that I am hitting. Is it possible that I'm hitting another (somewhat similar) error in production, possibly a different limit? And is it possible for exec-port to recover from the error?

I'll try to track down the specific error that I am hitting in production.

saleyn commented 4 years ago

EBADF normally indicates an attempt to select from a closed FD. But this issue maybe masking some other problem related to hitting some limit.

I won't have time to look at this issue and your repos for the next 2-3 months, but it was in my plans to rewrite the select() with a poll(), which in this case would allow to determine which FD has an issue and possibly recover gracefully.

axelson commented 4 years ago

Okay, thanks for the info.

chrta commented 3 years ago

FYI, i see the same issue in fedora 33, erlang 23.2.1 and elixir 1.11.2 with your repo:

$ ulimit -n 1048576
$ LANG=C mix test

...
number: 736
number: 737
number: 738
number: 739

  1) test greets the world (ErlexecReproTest)
     test/erlexec_repro_test.exs:5
     ** (exit) exited in: Task.Supervised.stream(30000)
         ** (EXIT) time out
     code: |> Enum.to_list()
     stacktrace:
       (elixir 1.11.2) lib/task/supervised.ex:304: Task.Supervised.stream_reduce/7
       (elixir 1.11.2) lib/enum.ex:3461: Enum.reverse/1
       (elixir 1.11.2) lib/enum.ex:3054: Enum.to_list/1
       test/erlexec_repro_test.exs:18: (test)

Finished in 30.0 seconds
1 test, 1 failure

and it hangs.

With ulimit -n 1024 it works as expected and it does not hang.

chrta commented 3 years ago

From the select man page:

POSIX allows an implementation to define an upper limit, advertised via the constant FD_SETSIZE, on the
range of file descriptors that can be specified in a file descriptor set.  The Linux kernel imposes no fixed
limit, but the glibc implementation makes fd_set a fixed-size type, with FD_SETSIZE defined as 1024,
and the FD_*() macros operating according to that limit.
To monitor file descriptors greater than 1023, use poll(2) or epoll(7) instead.

So a rewrite to poll seems to be required to handle a lot of open processes.

saleyn commented 3 years ago

Could you check this issue? I believe it's fixed in the last commit.

chrta commented 3 years ago

Great, thank you. Now it works (for me) in fedora 33, erlang 23.2.1 and elixir 1.11.2 with erlexec version 1.18.2.

axelson commented 3 years ago

For reference, this appears to be the referenced commit: https://github.com/saleyn/erlexec/commit/676b68cf86f8978989edf86042c3ec4b8a66304f

It definitely seems to have improved the behavior that I'm seeing. However, I'm still having an issue with my reproduction script. The tasks that I'm starting in the script have started failing with:

09:30:06.365 [error] Task #PID<0.1026.0> started from #PID<0.196.0> terminating
** (stop) exited in: :gen_server.call(:exec, {:port, {{:run, '/home/jason/erlexec_repro/_build/test/lib/erlexec_repro/priv/dfrotz -r lt -r lt -f ansi -w80 -h120 -R /tmp/frotz-596-r5lYK92mN2l4dT83TrWb /home/jason/erlexec_repro/_build/test/lib/erlexec_repro/priv/kitten.z5', [{:cd, '/tmp/frotz-596-r5lYK92mN2l4dT83TrWb'}, :stdout, :stderr, :stdin, :monitor, {:kill_timeout, 30000000}]}, :link, false}}, 30000)
    ** (EXIT) time out
    (stdlib) gen_server.erl:223: :gen_server.call/3
    (erlexec) /home/jason/erlexec_repro/deps/erlexec/src/exec.erl:859: :exec.do_run/2
    test/erlexec_repro_test.exs:11: anonymous fn/1 in ErlexecReproTest."test greets the world"/1
    (elixir) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2
    (elixir) lib/task/supervised.ex:35: Task.Supervised.reply/5
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Function: &:erlang.apply/2
    Args: [#Function<0.51747327/1 in ErlexecReproTest."test greets the world"/1>, [596]]

The 30_000 timeout in the genserver.call appears to be hard-coded. I might be able to change my test to work-around this limitation, but it would be nice if there was a way to get it to work more or less as is.

saleyn commented 3 years ago

Added the configurable timeout.

axelson commented 3 years ago

Thanks for adding the timeout. Unfortunately I'm still able to reproduce this issue (or perhaps a similar issue). What I see is exec-port is still stuck at 100% CPU, even multiple hours after the beam instance has stopped. I have not yet checked with strace or done additional debugging.

saleyn commented 3 years ago

This likely still has something to do with hitting the limit of the number of open file descriptors. In your sample code when I modify the demo function to:

  def demo do
    :exec.run_link("sleep 1", [
      {:cd, "/"},
      :monitor
    ])
  end

I see no issues. Will need to investigate this further when I have some spare time.

saleyn commented 3 years ago

I slightly modified the test that comes with `erlexec', and now you can configure the concurrency and sleep time of the spawned OS process being tested. Additionally the test checks for errors related to hitting the file descriptor limit:

$ PID_SLEEP_SEC=15 RUN_COUNT=1500 make test
#@OPTIMIZE=0 /opt/bin/rebar3 eunit
===> Verifying dependencies...
===> Analyzing applications...
===> Compiling erlexec
===> Performing EUnit tests...
................
==> Test Concurrency: 1500
F
Failures:

  1) exec:exec_run_many_test_/0:1329
     Failure/Error: ?assertMatch({ ok , [ { io_ops , M } , { success , N } ] }, test_exec : run ( N , 60000 , Delay ))
       expected: = { ok , [ { io_ops , M } , { success , N } ] }
            got: {error,hit_limit_of_max_open_files,
                        [{io_ops,1982},{success,991},{fd_err,509}]}
     %% eunit_proc.erl:561:in `eunit_proc:run_group/2`
     Output: 
     Output: 

Finished in 28.968 seconds
17 tests, 1 failures
===> Error running tests

But irrespective, I cannot reproduce the `exec-port' hanging issue.

If you see the port program being stuck, build it with:

OPTIMIZE=0 make

then run your test case, and when you see the port program being stuck, attach to it with a debugger:

$ pgrep exec-port
1234
$ gdb /path/to/exec-port
attach 1234
^C
bt

Please rerun your code using the latest master.

saleyn commented 3 years ago

Is this still an issue?

axelson commented 3 years ago

I just tested on the latest master and confirmed that I'm not longer able to reproduce a hang in the exec-port process :tada: thanks!

saleyn / erlexec

exec-port stuck at 100% cpu #136