File descriptor leak under high concurrency

ThePumpingLemma commented 1 year ago

Describe the bug With our PoC at https://github.com/kevink-sq/jetty-concurrency-issue-poc/tree/575469b3453e7b09d9d60f6460cdab8117b8e773, we noticed that there is unbounded growth of file descriptors when running our simple load test:

seq 1 15000 | xargs -P1500 -I{} curl --unix-socket ./junix-ingress.sock http://localhost/process

Leading to timeouts from curl and timeout exceptions in jetty:

java.util.concurrent.TimeoutException: Idle timeout expired: 32422/30000 ms
        at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:170)
        at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:112)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)

To Reproduce Steps to reproduce the behavior (assuming macOS):

Checkout https://github.com/kevink-sq/jetty-concurrency-issue-poc/tree/575469b3453e7b09d9d60f6460cdab8117b8e773
Run mvn compile exec:java -Dexec.mainClass="org.example.Main"
Run seq 1 15000 | xargs -P1500 -I{} curl --unix-socket ./junix-ingress.sock http://localhost/process

Expected behavior All curl calls should succeed and there should be no timeouts in jetty.

Output/Screenshots N/A

Please make sure to test the problem still occurs on the latest version of junixsocket Tested on v2.8.2

Notes Proposed fix: https://github.com/kohlschutter/junixsocket/pull/144

kevink-sq commented 1 year ago

Hi @kohlschuetter, would it be possible to look into this issue?

kohlschuetter commented 1 year ago

Your commands to reproduce may not work on all platforms (e.g., Alpine Linux) and may return the wrong process ("grep" itself on macOS).

I'm trying to verify with

mvn clean compile exec:java -Dexec.mainClass="org.example.Main"

and

sudo apk add lsof
lsof -p $(jps -v | grep jetty-concurrency | cut -f1 -d' ') | grep junix-ingress

The commit you recommend to reproduce does not work for me. When I manually remove the monkey-patched files, everything still works OK on macOS, but I see connection errors on Alpine Linux. The monkey patched code (assuming it's the same as in the PR), indeed seems to fix that bug, but I'm not seeing the timeout error messages.

ThePumpingLemma commented 1 year ago

The commit you recommend to reproduce does not work for me. When I manually remove the monkey-patched files, everything still works OK on macOS, but I see connection errors on Alpine Linux. The monkey patched code (assuming it's the same as in the PR), indeed seems to fix that bug, but I'm not seeing the timeout error messages.

I didn't test on Alpine; @kevin-sq did reproduce on Linux?

Not sure if macOS version matters in this case:

Darwin redacted.local 23.1.0 Darwin Kernel Version 23.1.0: Mon Oct  9 21:27:24 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6000 arm64

kohlschuetter commented 1 year ago

@ThePumpingLemma :

lsof -p $(ps aux | grep "org.example.Main" | awk {'print$2'} | head -n 1) | grep junix-ingress.sock | wc -l

may actually get the PID of the grep command ,which the jps approach prevents. You could also add another grep for "java", and remove the head -n 1 to raise a red flag should there still be more than one line:

lsof -p $(ps aux | grep "org.example.Main" | grep java | awk {'print$2'}) | grep junix-ingress.sock | wc -l

Also beware that lsof on Alpine is busybox-based, so it will actually show all PIDs unless we install the proper binary via sudo apk add lsof

kohlschuetter commented 1 year ago

I've merged the PR, however I'm not really happy that we don't have a proper test case. However, I can see these errors in curl go away with the fix.

kevink-sq commented 1 year ago

I'm able to reproduce the leaking junix file descriptor in linux via multipass against this baseline branch (which is the same as the above 575469b3453e7b09d9d60f6460cdab8117b8e773 but removed debug logging and other monkey patch files).

Linux info:

Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-87-generic aarch64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Mon Nov  6 09:45:35 PST 2023

  System load:             0.1953125
  Usage of /:              10.9% of 24.05GB
  Memory usage:            8%
  Swap usage:              0%
  Processes:               127
  Users logged in:         1

In one terminal started monitoring the fd count of junix-ingress.sock

while sleep 1; do echo $(lsof -p $(ps aux | grep "org.example.Main" | awk {'print$2'} | head -n 1) | grep junix-ingress.sock | wc -l); done

In another terminal, started the server against baseline branch

mvn compile exec:java -Dexec.mainClass="org.example.Main"

In another terminal started calling the junix uds:

seq 1 300000 | xargs -P1500 -I{} curl --unix-socket /tmp/junix-ingress.sock http://localhost/process

I could verify the monitor count started climbing but not for the native endpoint. Screenshots of the above process:

Server started with monitoring. Junix uds not being called yet:

Screenshot 2023-11-06 at 10 12 12 AM

Started calling junix uds

Screenshot 2023-11-06 at 10 18 10 AM

Running against native endpoint does not leak

Screenshot 2023-11-06 at 10 16 20 AM

Screenshot 2023-11-06 at 10 16 53 AM

kevink-sq commented 1 year ago

Repeating the above with the monkey patched fix (baseline branch patched) results in junix behaving the same as native.

kohlschuetter commented 1 year ago

Can you please verify with the latest 2.8.3-SNAPSHOT, and all monkey-patched code removed?

Make sure the following section is in your POM:

            <repositories>
                <repository>
                    <id>snapshots-repo</id>
                    <url>https://oss.sonatype.org/content/repositories/snapshots</url>
                    <releases>
                        <enabled>false</enabled>
                    </releases>
                    <snapshots>
                        <enabled>true</enabled>
                    </snapshots>
                </repository>
            </repositories>

kevink-sq commented 1 year ago

I can verify the fix works against the snapshot with all of the monkey-patched code removed: baseline branch 2.8.3-SNAPSHOT

ThePumpingLemma commented 1 year ago

I've merged the PR, however I'm not really happy that we don't have a proper test case. However, I can see these errors in curl go away with the fix.

It's definitely possible to stress test this, so we could make integration tests to that effect.

However, I don't see why we couldn't write the code in such a way that these routines are unit testable; then we could mock selection keys and see how the cleanup code handles it. I could take a pass at a refactor to facilitate with unit testing if you would like?

kohlschuetter commented 1 year ago

Yes that would be great, thank you!

ThePumpingLemma commented 1 year ago

@kohlschuetter I think I want to approach it by first making some broader fixes before I break it down into unit testable functions. The first thing I want to attack is locking, where I noticed some inconsistencies in the way locks are managed. Here is a PR where I do some renaming followed by lock normalization: https://github.com/kohlschutter/junixsocket/pull/146

I would like to follow this up with reworking how PollFd objects are managed; I think it's a bit hard to follow and I also think there is redundant code in initPollFd (which is the code we need to unit test).

kohlschuetter commented 1 year ago

Before we continue with improvements to AFSelector, please verify that the updated patch (5e8a6e6) resolves all concurrency issues / FileDescriptor leaks for you.

I reverted the original patch attempts because we really introduced the regression in 2.8.2, and even with the latest changes, a TIPC test would fail on Linux.

Please also verify by running the selftest jar on Linux with the TIPC kernel module loaded, and a bearer enabled:

sudo modprobe tipc
sudo tipc bearer enable media eth device eth0
java -ea -jar junixsocket-selftest-2.8.3-SNAPSHOT-jar-with-dependencies.jar

ThePumpingLemma commented 1 year ago

Before we continue with improvements to AFSelector, please verify that the updated patch (5e8a6e6) resolves all concurrency issues / FileDescriptor leaks for you.

It seems to be working okay on macOS from a preliminary test.

However, this patch doesn't implement the locking in the way that the documentation suggests we should:

A selector's selected-key set is not, in general, safe for use by multiple concurrent threads. If such a thread might modify the set directly then access should be controlled by synchronizing on the set itself. The iterators returned by the set's {@link java.util.Set#iterator() iterator} methods are fail-fast: If the set is modified after the iterator is created, in any way except by invoking the iterator's own {@link java.util.Iterator#remove() remove} method, then a {@link java.util.ConcurrentModificationException} will be thrown.

IMO, I think it would make sense to implement things as similar to Java's standard library as possible, but I'll leave that up to you.

I don't have a Linux test setup handy, maybe @kevink-sq can try to verify the change?

kohlschuetter commented 1 year ago

Yeah, it's really odd. When you replace that ConcurrentHashMap with HashMap, jetty throws a ConcurrentModificationException, which means that either two threads are accessing the structure concurrently, or modifications happen after the iterator has been created. Maybe there's something else we're missing?

kohlschuetter commented 1 year ago

So I think it's the .clear() that's throwing off jetty. In the original SelectorImpl, they check if the key already exists (processReadyEvents):

            if (selectedKeys.contains(ski)) {
                if (ski.translateAndUpdateReadyOps(rOps)) {
                    return 1;
                }
            } else {
                ski.translateAndSetReadyOps(rOps);
                if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
                    selectedKeys.add(ski);
                    return 1;
                }
            }

kohlschuetter commented 1 year ago

@ThePumpingLemma I was able to remove the second ConcurrentHashMap with 38cda27. Can you please take another look?

kohlschuetter commented 1 year ago

(Fixed checkstyle warning)

kevink-sq commented 1 year ago

@kohlschuetter are the latest changes (including 38cda270d3aa7a8d3b3cb1803b2c02e1d33c9101) on 2.8.3-SNAPSHOT?

kohlschuetter commented 1 year ago

@kevink-sq Uploaded just now! 20231109.194633-3 Be sure to run mvn with -U to update snapshots.

ThePumpingLemma commented 1 year ago

I reviewed the changes and have some thoughts:

I think this is a lot more complicated than synchronizing on this and selectedKeysPublic. My preference would be to keep the locking + data structures simple and try to mirror the reference implementation (i.e. Java standard library) to minimize surprises.
MapValueSet could be thread-safe, but I'm not convinced without doing a really thorough analysis (I looked pretty closely at the iterator, but still need to chew on the rest). In contrast, it is really easy to look at synchronized blocks and convince yourself that all of the state mutation is under a lock and that there are no races.
MapValueSet is designed to be very generic (e.g. removedSentinel is optional), but I'm not sure it's worth the additional complexity. If you decide to stick with this approach, consider simplifying where possible.

All in all, I think it is possible to prove your approach safe, but I'm not personally comfortable with it.

kohlschuetter commented 1 year ago

MapValueSet wraps the ConcurrentHashMap, so it is as safe as that, and may throw the ConcurrentModificationException only in certain narrow circumstances with iterator(), which is in line with the AFSelector documentation (selectedKeys() sets don't have to be thread-safe). I actually wanted to build this a long time ago, since having a single map for all keys simplifies things quite a bit, so I'm very comfortable with it.

Once we had the ConcurrentHashMap for registered keys, it was clear that simply ignoring the value was a waste. Adding another ConcurrentHashMap resulted in simpler code but higher memory requirements. I'm happy that we got that out of the way.

kohlschuetter commented 1 year ago

As a next step, could you take a look at creating some unit tests around the selected keys behavior? Once we have good test coverage, we can make more confident optimizations without introducing further regressions.

Thanks for your help, @ThePumpingLemma!

kevink-sq commented 1 year ago

Reran the above tests against junixsocket-jetty-2.8.3-20231109.194633-3.jar artifact and the concurrency issue and leaking file descriptors are still fixed.

kevink-sq commented 1 year ago

I'm also for @ThePumpingLemma's approach to mirroring the reference implementation in the Java standard library. Things like the concurrency issue and the file descriptor leaks were resolved from following their examples and straying from them with novel implementations may have setbacks simply because they haven't been battle tested as much. Unless there are clear benefits to the approach, I'm not sure what is gained.

Furthermore, JEP-380 implements kqueue/epoll under the hood which gives a performance edge over junix. If junix were to follow suit, I think it'd be easier to make the changes if references were taken from the Java standard library beforehand.

kohlschuetter commented 1 year ago

Once Java supports all sorts of obscure socket types like AF_UNIX, AF_TIPC, AF_VSOCK, etc., there will probably be little need for junixsocket, which would actually be a good thing.

Until then, junixsocket remains a standalone project, under a different license, so copying code from the standard library just isn't an easy option.

Moreover, unlike commercial endeavors like Square/Cash app, this is my personal project. I'm not being paid to develop it, which means I can deliberately spend some extra time on polishing bits that I would otherwise leave as-is.

I hear your concerns. I don't want to sound harsh. However, please accept my explanation, and decision.

In this instance, removing the second Set is, in my view, the right thing to do, as it removes unnecessary allocation operations in a hot path (polling sockets). It also obviously fixes the issue you initially reported, without introducing any obvious regressions (unlike the earlier attempts).

Since I see you care about improving junixsocket, I hope you continue to contribute by filing tickets and PRs. Abstract, high-level critique as in "not battle tested" or "what about kqueue/epoll" is not very constructive for this present ticket. Please be very specific with your concerns, and, ideally, provide solutions for these concerns. As I said, I'm not being paid for this project, so I have to manage my attention wisely.

If you find that junixsocket is slower/worse than some other approach, please file a bug with some data/code, ideally with a PR that you accept may end up not the way it started.

As I explained earlier, the best way to contribute to the very specific situation of AFSelector is to improve the unit test coverage around selectors. When we have good coverage, we can look into taking more inspiration from the standard library if and only if there is a measurable benefit. With good test coverage, the conversation automatically becomes more detail-oriented, more data-informed, and less opinion-based.

You're very welcome to use JEP-380, just as you're welcome to help improve junixsocket. For what it's worth, I'm seeing higher throughput with junixsocket, compared to JEP-380, and I hope you will find too that junixsocket is the better choice, at least for the foreseeable future.

kohlschuetter commented 12 months ago

2.8.3 has just been released, including this fix. Please verify and re-open this ticket if necessary. Thanks again for reporting and working with me on this issue.

kohlschutter / junixsocket

File descriptor leak under high concurrency #145