spring-projects / spring-integration

Spring Integration provides an extension of the Spring programming model to support the well-known Enterprise Integration Patterns (EIP)
http://projects.spring.io/spring-integration/
Apache License 2.0
1.54k stars 1.11k forks source link

spring-integration-sftp leads to sshd-SshClien threads leaks #9373

Closed roma2341 closed 2 months ago

roma2341 commented 2 months ago

In what version(s) of Spring Integration are you seeing this issue? 6.3.2 Describe the bug Recently I got exception: Caused by: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached First I checked result of: systemctl status myService It showed me: `Tasks: 4012 (limit: 9385) //usually this number shows me several thousands, I didn't check it at the moment when OOM happened, but I think it was near 9k

Then I made thread dump and saw that almost all threads are sshd e.g sshd-SshClient[96ccf2f]-nio2-thread-2" #13237 [581081] daemon prio=5 os_prio=0 cpu=6.82ms elapsed=19393.97s tid=0x00007ff599190fa0 nid=581081 waiting on condition [0x00007ff4512fe000]

To Reproduce

I can't provide relevant examples now

Expected behavior

spring-integration-sftp thread leaks must not happen.

artembilan commented 2 months ago

Any chances to have more info about those threads? What kind of Spring Integration configuration do you use?

roma2341 commented 2 months ago

I'm not sure if it's safe to provide a thread dump

artembilan commented 2 months ago

So, according to debugging I see that those threads come from the Nio2ServiceFactory:

    public Nio2ServiceFactory(FactoryManager factoryManager, CloseableExecutorService service,
                              CloseableExecutorService resumeTasks) {
        super(factoryManager,
              ThreadUtils.newFixedThreadPoolIf(service, factoryManager.toString() + "-nio2", getNioWorkers(factoryManager)));

And this is that thread pool size:

    public static final Property<Integer> NIO_WORKERS
            = Property.validating(Property.integer("nio-workers", Runtime.getRuntime().availableProcessors() + 1),
                    w -> ValidateUtils.checkTrue(w > 0, "Number of NIO workers must be positive: %d", w));

Now it would be great to see where they are used so we see that dead-lock. I mean according to sshd-core source code that getExecutorService() is used everywhere in the logic.

Eventually it leads me that this NIO2 stuff is used by the SshClient we initialize only once in the DefaultSftpSessionFactory:

    private void doInitClient() throws IOException {
        if (this.port <= 0) {
            this.port = SshConstants.DEFAULT_PORT;
        }

        doInitInnerClient();

        this.sshClient.start();
    }

So, I'm not sure how we can have so many of those threads. Therefore it might be great to have some simple project from you to reproduce.

I'm not sure if it's safe to provide a thread dump

No need the whole one, but at least some stack trace to follow back to Spring code where we indeed can prove that we do something wrong or so.

roma2341 commented 2 months ago

Here are some excerpts: "sshd-SshClient[30df8076]-nio2-thread-1" #3409 [690881] daemon prio=5 os_prio=0 cpu=3.01ms elapsed=3401.81s tid=0x00007fc6ec4ed420 nid=690881 waiting on condition [0x00007fc63d8fe000] "sshd-SshClient[30df8076]-nio2-thread-2" #3410 [690882] daemon prio=5 os_prio=0 cpu=5.03ms elapsed=3401.81s tid=0x00007fc6ec558050 nid=690882 waiting on condition [0x00007fc63d7fe000] "sshd-SshClient[30df8076]-nio2-thread-3" #3411 [690883] daemon prio=5 os_prio=0 cpu=1.23ms elapsed=3401.81s tid=0x00007fc6a0465060 nid=690883 waiting on condition [0x00007fc63d6fe000] "sshd-SshClient[3e363e7f]-timer-thread-1" #3458 [691358] daemon prio=5 os_prio=0 cpu=120.40ms elapsed=2799.49s tid=0x00007fc6a0466a50 nid=691358 waiting on condition [0x00007fc63d4fe000] "sshd-SshClient[3e363e7f]-nio2-resume-thread-1" #3460 [691360] daemon prio=5 os_prio=0 cpu=1.27ms elapsed=2799.48s tid=0x00007fc6a0467280 nid=691360 waiting on condition [0x00007fc63d2fe000] "sshd-SshClient[3e363e7f]-nio2-thread-1" #3461 [691361] daemon prio=5 os_prio=0 cpu=2.07ms elapsed=2799.47s tid=0x00007fc6f8460b60 nid=691361 waiting on condition [0x00007fc63d1fe000] "sshd-SshClient[3e363e7f]-nio2-thread-2" #3462 [691362] daemon prio=5 os_prio=0 cpu=6.42ms elapsed=2799.47s tid=0x00007fc6f8076ee0 nid=691362 waiting on condition [0x00007fc63d0fe000] "sshd-SshClient[3e363e7f]-nio2-thread-3" #3463 [691363] daemon prio=5 os_prio=0 cpu=9.71ms elapsed=2799.47s tid=0x00007fc6f02c1670 nid=691363 waiting on condition [0x00007fc63cffe000] "sshd-SshClient[582a45e1]-timer-thread-1" #3464 [691364] daemon prio=5 os_prio=0 cpu=116.43ms elapsed=2798.74s tid=0x00007fc6a04014b0 nid=691364 waiting on condition [0x00007fc63cefe000] "sshd-SshClient[582a45e1]-nio2-resume-thread-1" #3466 [691366] daemon prio=5 os_prio=0 cpu=1.31ms elapsed=2798.74s tid=0x00007fc6a04024a0 nid=691366 waiting on condition [0x00007fc63ccfe000] "sshd-SshClient[582a45e1]-nio2-thread-1" #3467 [691367] daemon prio=5 os_prio=0 cpu=1.96ms elapsed=2798.73s tid=0x00007fc684fb5f70 nid=691367 waiting on condition [0x00007fc63cbfe000] "sshd-SshClient[582a45e1]-nio2-thread-2" #3468 [691368] daemon prio=5 os_prio=0 cpu=3.65ms elapsed=2798.73s tid=0x00007fc685000720 nid=691368 waiting on condition [0x00007fc63cafe000] "sshd-SshClient[582a45e1]-nio2-thread-3" #3469 [691369] daemon prio=5 os_prio=0 cpu=3.93ms elapsed=2798.73s tid=0x00007fc6885f6a30 nid=691369 waiting on condition [0x00007fc63c9fe000] "sshd-SshClient[3b9abb6e]-timer-thread-1" #3499 [692135] daemon prio=5 os_prio=0 cpu=89.89ms elapsed=2202.45s tid=0x00007fc6ec5cadd0 nid=692135 waiting on condition [0x00007fc63e1fe000] "sshd-SshClient[3b9abb6e]-nio2-resume-thread-1" #3501 [692137] daemon prio=5 os_prio=0 cpu=1.62ms elapsed=2202.43s tid=0x00007fc6ec5159c0 nid=692137 waiting on condition [0x00007fc63c7fe000] "sshd-SshClient[3b9abb6e]-nio2-thread-1" #3502 [692138] daemon prio=5 os_prio=0 cpu=1.87ms elapsed=2202.42s tid=0x00007fc698419710 nid=692138 waiting on condition [0x00007fc63c8fe000] "sshd-SshClient[3b9abb6e]-nio2-thread-2" #3503 [692139] daemon prio=5 os_prio=0 cpu=6.49ms elapsed=2202.42s tid=0x00007fc69841a470 nid=692139 waiting on condition [0x00007fc63c3fe000] "sshd-SshClient[3b9abb6e]-nio2-thread-3" #3504 [692140] daemon prio=5 os_prio=0 cpu=9.78ms elapsed=2202.41s tid=0x00007fc68cd8ee50 nid=692140 waiting on condition [0x00007fc63c2fe000] "sshd-SshClient[3a081d39]-timer-thread-1" #3505 [692141] daemon prio=5 os_prio=0 cpu=87.86ms elapsed=2201.65s tid=0x00007fc6ec513ef0 nid=692141 waiting on condition [0x00007fc63c1fe000] "sshd-SshClient[3a081d39]-nio2-resume-thread-1" #3507 [692143] daemon prio=5 os_prio=0 cpu=1.15ms elapsed=2201.64s tid=0x00007fc6ec5d0080 nid=692143 waiting on condition [0x00007fc63bffe000] "sshd-SshClient[3a081d39]-nio2-thread-1" #3508 [692144] daemon prio=5 os_prio=0 cpu=2.63ms elapsed=2201.64s tid=0x00007fc6f02bfd20 nid=692144 waiting on condition [0x00007fc63befe000] "sshd-SshClient[3a081d39]-nio2-thread-2" #3509 [692145] daemon prio=5 os_prio=0 cpu=4.76ms elapsed=2201.64s tid=0x00007fc6f039b9a0 nid=692145 waiting on condition [0x00007fc63bdfe000] "sshd-SshClient[3a081d39]-nio2-thread-3" #3510 [692146] daemon prio=5 os_prio=0 cpu=1.04ms elapsed=2201.63s tid=0x00007fc6f45bce90 nid=692146 waiting on condition [0x00007fc63bcfe000] "sshd-SshClient[4c141977]-timer-thread-1" #3565 [692650] daemon prio=5 os_prio=0 cpu=60.94ms elapsed=1599.99s tid=0x00007fc6807bf3b0 nid=692650 waiting on condition [0x00007fc63bbfe000] "sshd-SshClient[4c141977]-nio2-resume-thread-1" #3567 [692652] daemon prio=5 os_prio=0 cpu=1.23ms elapsed=1599.93s tid=0x00007fc680e9ee80 nid=692652 waiting on condition [0x00007fc63c4fe000] "sshd-SshClient[4c141977]-nio2-thread-1" #3568 [692653] daemon prio=5 os_prio=0 cpu=1.80ms elapsed=1599.92s tid=0x00007fc6a04d94d0 nid=692653 waiting on condition [0x00007fc63b9fe000] "sshd-SshClient[4c141977]-nio2-thread-2" #3569 [692654] daemon prio=5 os_prio=0 cpu=6.33ms elapsed=1599.92s tid=0x00007fc6a04de630 nid=692654 waiting on condition [0x00007fc63b6fe000] "sshd-SshClient[4c141977]-nio2-thread-3" #3570 [692655] daemon prio=5 os_prio=0 cpu=8.82ms elapsed=1599.91s tid=0x00007fc6945c3a20 nid=692655 waiting on condition [0x00007fc63b7fe000] "sshd-SshClient[4ecd70c2]-timer-thread-1" #3571 [692656] daemon prio=5 os_prio=0 cpu=65.01ms elapsed=1599.14s tid=0x00007fc6801a1850 nid=692656 runnable [0x00007fc63b5fe000] "sshd-SshClient[4ecd70c2]-nio2-resume-thread-1" #3573 [692658] daemon prio=5 os_prio=0 cpu=1.19ms elapsed=1599.13s tid=0x00007fc680eb9350 nid=692658 waiting on condition [0x00007fc63b3fe000] "sshd-SshClient[4ecd70c2]-nio2-thread-1" #3574 [692659] daemon prio=5 os_prio=0 cpu=2.89ms elapsed=1599.13s tid=0x00007fc670b33650 nid=692659 waiting on condition [0x00007fc63b2fe000] "sshd-SshClient[4ecd70c2]-nio2-thread-2" #3575 [692660] daemon prio=5 os_prio=0 cpu=4.90ms elapsed=1599.13s tid=0x00007fc671bf9120 nid=692660 waiting on condition [0x00007fc63b1fe000] "sshd-SshClient[4ecd70c2]-nio2-thread-3" #3576 [692661] daemon prio=5 os_prio=0 cpu=1.10ms elapsed=1599.12s tid=0x00007fc6f82abbe0 nid=692661 waiting on condition [0x00007fc63b0fe000] "sshd-SshClient[3cf2f50]-timer-thread-1" #3619 [693130] daemon prio=5 os_prio=0 cpu=41.07ms elapsed=1002.58s tid=0x00007fc6a00d7030 nid=693130 waiting on condition [0x00007fc63c6fe000] "sshd-SshClient[3cf2f50]-nio2-resume-thread-1" #3621 [693132] daemon prio=5 os_prio=0 cpu=1.75ms elapsed=1002.56s tid=0x00007fc6a037fb00 nid=693132 waiting on condition [0x00007fc63adfe000] "sshd-SshClient[3cf2f50]-nio2-thread-1" #3622 [693133] daemon prio=5 os_prio=0 cpu=2.39ms elapsed=1002.55s tid=0x00007fc671dde000 nid=693133 waiting on condition [0x00007fc63affe000] "sshd-SshClient[3cf2f50]-nio2-thread-2" #3623 [693134] daemon prio=5 os_prio=0 cpu=6.97ms elapsed=1002.55s tid=0x00007fc671d2ae90 nid=693134 waiting on condition [0x00007fc63bafe000] "sshd-SshClient[3cf2f50]-nio2-thread-3" #3624 [693135] daemon prio=5 os_prio=0 cpu=10.36ms elapsed=1002.54s tid=0x00007fc6f82acd90 nid=693135 waiting on condition [0x00007fc63aefe000] "sshd-SshClient[100a02a3]-timer-thread-1" #3625 [693138] daemon prio=5 os_prio=0 cpu=39.63ms elapsed=1001.82s tid=0x00007fc6a0380ca0 nid=693138 waiting on condition [0x00007fc63acfe000] "sshd-SshClient[100a02a3]-nio2-resume-thread-1" #3627 [693140] daemon prio=5 os_prio=0 cpu=1.26ms elapsed=1001.82s tid=0x00007fc6a03a8fb0 nid=693140 waiting on condition [0x00007fc63aafe000] "sshd-SshClient[100a02a3]-nio2-thread-1" #3628 [693141] daemon prio=5 os_prio=0 cpu=3.19ms elapsed=1001.81s tid=0x00007fc6904c27b0 nid=693141 waiting on condition [0x00007fc63a9fe000] "sshd-SshClient[100a02a3]-nio2-thread-2" #3629 [693142] daemon prio=5 os_prio=0 cpu=5.29ms elapsed=1001.81s tid=0x00007fc6904fd9e0 nid=693142 waiting on condition [0x00007fc63a8fe000] "sshd-SshClient[100a02a3]-nio2-thread-3" #3630 [693143] daemon prio=5 os_prio=0 cpu=1.39ms elapsed=1001.81s tid=0x00007fc68514d300 nid=693143 waiting on condition [0x00007fc63a7fe000] "sshd-SshClient[3940b428]-timer-thread-1" #3674 [693936] daemon prio=5 os_prio=0 cpu=17.17ms elapsed=401.09s tid=0x00007fc6983fd9e0 nid=693936 runnable [0x00007fc63a5fe000] "sshd-SshClient[3940b428]-nio2-resume-thread-1" #3676 [693938] daemon prio=5 os_prio=0 cpu=1.52ms elapsed=401.03s tid=0x00007fc698368340 nid=693938 waiting on condition [0x00007fc63a6fe000] "sshd-SshClient[3940b428]-nio2-thread-1" #3677 [693939] daemon prio=5 os_prio=0 cpu=1.63ms elapsed=401.02s tid=0x00007fc68cdd6810 nid=693939 waiting on condition [0x00007fc63a2fe000] "sshd-SshClient[3940b428]-nio2-thread-2" #3678 [693940] daemon prio=5 os_prio=0 cpu=6.40ms elapsed=401.02s tid=0x00007fc68cd90820 nid=693940 waiting on condition [0x00007fc63a1fe000] "sshd-SshClient[3940b428]-nio2-thread-3" #3679 [693941] daemon prio=5 os_prio=0 cpu=9.22ms elapsed=401.01s tid=0x00007fc6904ee540 nid=693941 waiting on condition [0x00007fc63a4fe000] "sshd-SshClient[30922ac9]-timer-thread-1" #3680 [693944] daemon prio=5 os_prio=0 cpu=16.17ms elapsed=400.27s tid=0x00007fc6983fe690 nid=693944 waiting on condition [0x00007fc63a0fe000] "sshd-SshClient[30922ac9]-nio2-resume-thread-1" #3682 [693946] daemon prio=5 os_prio=0 cpu=1.26ms elapsed=400.27s tid=0x00007fc69836a0a0 nid=693946 waiting on condition [0x00007fc639efe000] "sshd-SshClient[30922ac9]-nio2-thread-1" #3683 [693947] daemon prio=5 os_prio=0 cpu=5.23ms elapsed=400.26s tid=0x00007fc6f417d0f0 nid=693947 waiting on condition [0x00007fc639dfe000] "sshd-SshClient[30922ac9]-nio2-thread-2" #3684 [693948] daemon prio=5 os_prio=0 cpu=1.20ms elapsed=400.26s tid=0x00007fc6f416cdd0 nid=693948 waiting on condition [0x00007fc639cfe000] "sshd-SshClient[30922ac9]-nio2-thread-3" #3685 [693949] daemon prio=5 os_prio=0 cpu=2.67ms elapsed=400.25s tid=0x00007fc6e830af20 nid=693949 waiting on condition [0x00007fc639bfe000]

roma2341 commented 2 months ago

`"sshd-SshClient[100a02a3]-timer-thread-1" #3625 [693138] daemon prio=5 os_prio=0 cpu=39.63ms elapsed=1001.82s tid=0x00007fc6a0380ca0 nid=693138 waiting on condition [0x00007fc63acfe000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"Thread-147" #3626 [693139] daemon prio=5 os_prio=0 cpu=1.57ms elapsed=1001.82s tid=0x00007fc6a037edd0 nid=693139 runnable [0x00007fc63abfe000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPoll.wait(java.base@21.0.4/Native Method) at sun.nio.ch.EPollPort$EventHandlerTask.poll(java.base@21.0.4/EPollPort.java:199) at sun.nio.ch.EPollPort$EventHandlerTask.run(java.base@21.0.4/EPollPort.java:280) at java.lang.Thread.runWith(java.base@21.0.4/Thread.java:1596) at java.lang.Thread.run(java.base@21.0.4/Thread.java:1583)

"sshd-SshClient[100a02a3]-nio2-resume-thread-1" #3627 [693140] daemon prio=5 os_prio=0 cpu=1.26ms elapsed=1001.82s tid=0x00007fc6a03a8fb0 nid=693140 waiting on condition [0x00007fc63aafe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"sshd-SshClient[100a02a3]-nio2-thread-1" #3628 [693141] daemon prio=5 os_prio=0 cpu=3.19ms elapsed=1001.81s tid=0x00007fc6904c27b0 nid=693141 waiting on condition [0x00007fc63a9fe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"sshd-SshClient[100a02a3]-nio2-thread-2" #3629 [693142] daemon prio=5 os_prio=0 cpu=5.29ms elapsed=1001.81s tid=0x00007fc6904fd9e0 nid=693142 waiting on condition [0x00007fc63a8fe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"sshd-SshClient[100a02a3]-nio2-thread-3" #3630 [693143] daemon prio=5 os_prio=0 cpu=1.39ms elapsed=1001.81s tid=0x00007fc68514d300 nid=693143 waiting on condition [0x00007fc63a7fe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"sshd-SshClient[3940b428]-timer-thread-1" #3674 [693936] daemon prio=5 os_prio=0 cpu=17.17ms elapsed=401.09s tid=0x00007fc6983fd9e0 nid=693936 runnable [0x00007fc63a5fe000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"Thread-148" #3675 [693937] daemon prio=5 os_prio=0 cpu=1.45ms elapsed=401.09s tid=0x00007fc6984132b0 nid=693937 runnable [0x00007fc63fafe000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPoll.wait(java.base@21.0.4/Native Method) at sun.nio.ch.EPollPort$EventHandlerTask.poll(java.base@21.0.4/EPollPort.java:199) at sun.nio.ch.EPollPort$EventHandlerTask.run(java.base@21.0.4/EPollPort.java:280) at java.lang.Thread.runWith(java.base@21.0.4/Thread.java:1596) at java.lang.Thread.run(java.base@21.0.4/Thread.java:1583)

"sshd-SshClient[3940b428]-nio2-resume-thread-1" #3676 [693938] daemon prio=5 os_prio=0 cpu=1.52ms elapsed=401.03s tid=0x00007fc698368340 nid=693938 waiting on condition [0x00007fc63a6fe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"sshd-SshClient[3940b428]-nio2-thread-1" #3677 [693939] daemon prio=5 os_prio=0 cpu=1.63ms elapsed=401.02s tid=0x00007fc68cdd6810 nid=693939 waiting on condition [0x00007fc63a2fe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"sshd-SshClient[3940b428]-nio2-thread-2" #3678 [693940] daemon prio=5 os_prio=0 cpu=6.40ms elapsed=401.02s tid=0x00007fc68cd90820 nid=693940 waiting on condition [0x00007fc63a1fe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"sshd-SshClient[3940b428]-nio2-thread-3" #3679 [693941] daemon prio=5 os_prio=0 cpu=9.22ms elapsed=401.01s tid=0x00007fc6904ee540 nid=693941 waiting on condition [0x00007fc63a4fe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"sshd-SshClient[30922ac9]-timer-thread-1" #3680 [693944] daemon prio=5 os_prio=0 cpu=16.17ms elapsed=400.27s tid=0x00007fc6983fe690 nid=693944 waiting on condition [0x00007fc63a0fe000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"Thread-149" #3681 [693945] daemon prio=5 os_prio=0 cpu=1.33ms elapsed=400.27s tid=0x00007fc6983c6860 nid=693945 runnable [0x00007fc639ffe000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPoll.wait(java.base@21.0.4/Native Method) at sun.nio.ch.EPollPort$EventHandlerTask.poll(java.base@21.0.4/EPollPort.java:199) at sun.nio.ch.EPollPort$EventHandlerTask.run(java.base@21.0.4/EPollPort.java:280) at java.lang.Thread.runWith(java.base@21.0.4/Thread.java:1596) at java.lang.Thread.run(java.base@21.0.4/Thread.java:1583)

"sshd-SshClient[30922ac9]-nio2-resume-thread-1" #3682 [693946] daemon prio=5 os_prio=0 cpu=1.26ms elapsed=400.27s tid=0x00007fc69836a0a0 nid=693946 waiting on condition [0x00007fc639efe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"sshd-SshClient[30922ac9]-nio2-thread-1" #3683 [693947] daemon prio=5 os_prio=0 cpu=5.23ms elapsed=400.26s tid=0x00007fc6f417d0f0 nid=693947 waiting on condition [0x00007fc639dfe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"sshd-SshClient[30922ac9]-nio2-thread-2" #3684 [693948] daemon prio=5 os_prio=0 cpu=1.20ms elapsed=400.26s tid=0x00007fc6f416cdd0 nid=693948 waiting on condition [0x00007fc639cfe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"sshd-SshClient[30922ac9]-nio2-thread-3" #3685 [693949] daemon prio=5 os_prio=0 cpu=2.67ms elapsed=400.25s tid=0x00007fc6e830af20 nid=693949 waiting on condition [0x00007fc639bfe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"task-751" #3707 [694143] prio=5 os_prio=0 cpu=3.34ms elapsed=170.26s tid=0x00007fc688582550 nid=694143 waiting on condition [0x00007fc6429fe000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"task-752" #3708 [694144] prio=5 os_prio=0 cpu=7.25ms elapsed=170.20s tid=0x00007fc688582c60 nid=694144 waiting on condition [0x00007fc6485fe000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"task-753" #3709 [694145] prio=5 os_prio=0 cpu=5.08ms elapsed=170.11s tid=0x00007fc6884775b0 nid=694145 waiting on condition [0x00007fc648bfe000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"task-754" #3710 [694146] prio=5 os_prio=0 cpu=6.06ms elapsed=170.07s tid=0x00007fc6885ba460 nid=694146 waiting on condition [0x00007fc648dfe000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"task-755" #3711 [694147] prio=5 os_prio=0 cpu=1.25ms elapsed=170.07s tid=0x00007fc6804e1e60 nid=694147 waiting on condition [0x00007fc65d2fe000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"task-756" #3712 [694148] prio=5 os_prio=0 cpu=1.20ms elapsed=170.06s tid=0x00007fc6885bb350 nid=694148 waiting on condition [0x00007fc63a3fe000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"task-757" #3713 [694151] prio=5 os_prio=0 cpu=3.89ms elapsed=166.01s tid=0x00007fc6884759b0 nid=694151 waiting on condition [0x00007fc6443fe000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"task-758" #3714 [694152] prio=5 os_prio=0 cpu=4.50ms elapsed=166.00s tid=0x00007fc6812f4e10 nid=694152 waiting on condition [0x00007fc63e2fe000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)

"GC Thread#1" os_prio=0 cpu=11063.12ms elapsed=41099.28s tid=0x00007fc694006310 nid=657467 runnable

"VM Thread" os_prio=0 cpu=3038.21ms elapsed=41101.83s tid=0x00007fc71c29e010 nid=657428 runnable

"VM Periodic Task Thread" os_prio=0 cpu=26401.58ms elapsed=41101.84s tid=0x00007fc71c284fe0 nid=657427 waiting on condition

"G1 Service" os_prio=0 cpu=1779.66ms elapsed=41101.88s tid=0x00007fc71c2744c0 nid=657426 runnable

"G1 Refine#0" os_prio=0 cpu=15093.16ms elapsed=41101.88s tid=0x00007fc71c2734f0 nid=657425 runnable

"G1 Conc#0" os_prio=0 cpu=56208.98ms elapsed=41101.88s tid=0x00007fc71c0a12c0 nid=657424 runnable

"G1 Main Marker" os_prio=0 cpu=94.64ms elapsed=41101.89s tid=0x00007fc71c0a0300 nid=657423 runnable

"GC Thread#0" os_prio=0 cpu=11148.85ms elapsed=41101.89s tid=0x00007fc71c095020 nid=657422 runnable `

roma2341 commented 2 months ago

Maybe @Scheduled can be a reason, i often had issues with different libraries e.g google libraries when calling in @Scheduled.

artembilan commented 2 months ago

You thread dump shows several SshClient instances. Is that expected?

I don't see any dead-locks in your dump. Mostly they are in parking state:

java.lang.Thread.State: TIMED_WAITING (parking)

apparently waiting for tasks to arrive into thread pool queue. Some of them are timed, others not:

            // Are workers subject to culling?
            boolean timed = allowCoreThreadTimeOut || wc > corePoolSize;
...
                Runnable r = timed ?
                    workQueue.poll(keepAliveTime, TimeUnit.NANOSECONDS) :
                    workQueue.take();

So, apparently these threads are not guilty (and even not suffering). You probably just really use too much memory.

All of these are just my assumptions. So, I'd like to learn more about your application and help you to figure out what and where should be fixed (or just adjusted).

roma2341 commented 2 months ago

I'll try to create a project over the weekend that reproduces this problem. We have @Scheduled that download files from remote SFTP server and process them. Our application uses SftpRemoteFileTemplate.list() to get a list of files in a remote folder. and then we download this files using SftpRemoteFileTemplate.get(). And then delete them using SftpRemoteFileTemplate.remove(). Each @Scheduled call creates new SftpRemoteFileTemplate object and new DefaultSftpSessionFactory() so I think that it's normal that there are many SshClient instances, but old ones have to be destroyed automatically, isn't it ?

roma2341 commented 2 months ago

I don't understand why these threads aren't released, I use DefaultSftpSessionFactory, not CachingSessionFactory.

artembilan commented 2 months ago

If you do new DefaultSftpSessionFactory() manually (perhaps there is a reason), that's where all those SshClient instances are created with their own thread pools. But if that (a DefaultSftpSessionFactory) is not a bean, then you have to call its destroy() manually as well. They are destroyed automatically only by Spring's ApplicationContext when it is closed.

As an alternative you may take a look into a shared SshClient:

    /**
     * Instantiate based on the provided {@link SshClient}, e.g. some extension for HTTP/SOCKS.
     * @param sshClient the {@link SshClient} instance.
     * @param isSharedSession true if the session is to be shared.
     */
    public DefaultSftpSessionFactory(SshClient sshClient, boolean isSharedSession) {
roma2341 commented 2 months ago

Thank you very much, I will try to use a single DefaultSftpSessionFactory instance, didn't know about this nuance. And didn't know that DefaultSftpSessionFactory is thread-safe, this behavior is not documented and how people can know that they have to destroy something, I spent several hours searching destroy/delete/release methods everywhere, but it's hard to find if you don't know what to look.

artembilan commented 2 months ago

You use Spring library, so it is expected that everything should be based on the Spring dependency injection container. Therefore in most cases the component you are going to use supposed to be registered into an ApplicationContenxt as a bean. And of course used from that context. If it is not state otherwise, it is assumed that Spring beans are singletons and therefore thread-safe. Another assumption we suggestion: if you use Spring library, then you are familiar what is Spring Framework and how it works and how can be used. The mentioned DefaultSftpSessionFactory implements DisposableBean which is a part of ApplicationContenxt lifecycle. Plus it is really a factory pattern implementation, so by definition of that pattern it is thread-safe.

roma2341 commented 2 months ago

I initially focused on the SftpRemoteFileTemplate without considering its connection to the Spring context. Moving forward, I'll make sure to check dependencies as well. I assumed that the file template would automatically handle resource cleanup and checked if it had methods for releasing resources. However, now I discovered that I needed to manually pass the factory as a bean or destroy it manually. I didn't expect this because I thought an SFTP connection would require multiple instances to connect to different servers depending on the requirements, and therefore wouldn't be used as a bean. I also can't assume that something is thread-safe based on its relation to spring. I had threading issues too often and now don't use classes that aren't marked as thread-safe in comments. I hope this ticket will help other people to not make my mistake. Thank you again!

artembilan commented 2 months ago

The SftpRemoteFileTemplate is also supposed to be a bean. And it is also thread-safe. As I said: in general in Spring if in class Javadoc we don't say otherwise, the class is thread-safe. Just because most of beans managed by the dependency injection container are singletons. So, they are created only once per application lifecycle. And therefore that leads us to an assumption that those beans are thread-safe. I suggestion you to take a look into Spring Framework docs for more info: https://docs.spring.io/spring-framework/reference/core/beans.html Spring Integration is fully based on the Framework foundation and almost everything assumes that it comes from the dependency injection container.

Yes, SftpRemoteFileTemplate logic is based on the SftpSession coming from the mentioned factory and this one is not thread-safe and therefore we don't provide a top-level API to deal with that class. That's where this template comes to the stage: to be thread-safe and handle those sessions for us during its operations.

Can we agree that there is nothing to do to from this project perspective and close the issue? I'm sorry to hear that you have had so hard experience with these classes. We will think with the team what can be done to avoid such a confusion in the future.

Thank you for your patience!

roma2341 commented 2 months ago

I checked again, issue was solved after I added destroy() call. It's also interesting that DefaultFtpSessionFactory doesn't need to be destroyed.

artembilan commented 2 months ago

Would you mind to clarify what you mean with the doesn't need to be destroyed ? You probably mean something outside of the mentioned Spring's destroy() hook. But then Java does not have such a hook and object is garbage-collected when reference to the object is lost.

artembilan commented 2 months ago

We added some paragraph into doc to explain the situation like this.

roma2341 commented 2 months ago

Would you mind to clarify what you mean with the doesn't need to be destroyed ? You probably mean something outside of the mentioned Spring's destroy() hook. But then Java does not have such a hook and the object is garbage-collected when reference to the object is lost.

I mean inconsistency: DefaultFtpSessionFactory - not managed by Spring (and I don't have to call destroy() on it) DefaultSftpSessionFactory - managed by Spring(and I have to call destroy() on it if I don't want to make it a bean)

public class DefaultSftpSessionFactory implements SessionFactory<SftpClient.DirEntry>, SharedSessionCapable, DisposableBean

public class DefaultFtpSessionFactory extends AbstractFtpSessionFactory<FTPClient>

I think this inconsistency is the main reason why we made this mistake, to add sftp support we just copy-pasted already implemented ftp logic, but instead of spring-integration-ftp we used spring-integration-sftp and replaced appropriated classes. As a programmer, we often don't look at class implementations and it's easy to skip that some factories need to be destroyed

artembilan commented 2 months ago

Yeah... That's just coincident that FTPClient is not thread-safe and don't manage any state in the DefaultFtpSessionFactory, so no need to clean it up in the end. "Managed" and " non-managed by Spring" is your choice. We recommend to use instances of those classes from the application context, even if there is no any lifecycle hooks in the class. Plus, it is really better to deal with singletons.

roma2341 commented 2 months ago

Yeah... That's just coincident that FTPClient is not thread-safe and don't manage any state in the DefaultFtpSessionFactory, so no need to clean it up in the end. "Managed" and " non-managed by Spring" is your choice. We recommend to use instances of those classes from the application context, even if there is no any lifecycle hooks in the class. Plus, it is really better to deal with singletons.

I always try to do it in the Spring way, but DefaultSftpSessionFactory contains details of one host, but we need to use different hosts based on yaml files settings that can be changed when the application running.

artembilan commented 2 months ago

Sounds like you need a @RefreshScope: https://docs.spring.io/spring-cloud-commons/reference/spring-cloud-commons/application-context-services.html#refresh-scope. So, it is still kinda singleton, but can be recreated when configuration is changed.