shadowsocks / shadowsocks-android

A shadowsocks client for Android
Other
35.23k stars 11.57k forks source link

'Too many open files' on Android 6.0-8.1 #2796

Closed imReker closed 1 year ago

imReker commented 3 years ago

LocalDnsWorker.accept will throw Broken pipe when UDP is filtered or network is disconnected. And then, if DNS query continue incoming, the unix socks handle of VpnService process will exceeds handle limit which is 1024 (32768 on Android 9.0 and newer), so finally VpnService process will get exception Too many opened files and Bad file descriptor everywhere. Meanwhile, because Java side UDP DNS query is timeout, sslocal will send TCP DNS query with 'java protected' socket, which create same amount of socket handles in sslocal. (Why sslocal makes a TCP query again?) As a result, both VpnService and sslocal crash at random time.

Logs: org.shadowsocks.xx_issue_19bc73ad993aad4d5fe278892d584231_error_session_61279182004D00013C85A04AC568A81B_DNE_5_v2.log org.shadowsocks.xx_issue_274bc2d242720049275714683d3d4cc5_error_session_6127B31401C900010D2CF9C39D05D8E2_DNE_0_v2.log org.shadowsocks.xx_issue_2d5e1ddcbf72ff6f25953b540bd48ff5_error_session_6127C4D9026F00011AC0A04AC568A81B_DNE_0_v2.log

imReker commented 3 years ago

Dumped File descriptor info, a DNS query from APP makes shadowsocks create at least 2 handle in VpnService, first one is local_dns_path of UDP query, and second is protect_path of TCP query.

fd list size = 928

fd list- 1bf: SOCK: socket:[23488393] UNIX / -- /
fd list- 1c0: SOCK: socket:[23472492] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1c2: SOCK: socket:[23490568] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/protect_path -- /dev/socket/dnsproxyd
fd list- 1c3: SOCK: socket:[23478133] UNIX / -- /
fd list- 1c4: SOCK: socket:[23466596] UNIX / -- /
fd list- 1c6: SOCK: socket:[23478364] UNIX / -- /
fd list- 1c7: SOCK: socket:[23466600] UNIX / -- /
fd list- 1c8: SOCK: socket:[23468862] UNIX / -- /
fd list- 1c9: SOCK: socket:[23490569] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/protect_path -- /dev/socket/dnsproxyd
fd list- 1ca: SOCK: socket:[23466603] UNIX / -- /
fd list- 1cb: SOCK: socket:[23478366] UNIX / -- /
fd list- 1cc: SOCK: socket:[23488399] UNIX / -- /
fd list- 1ce: SOCK: socket:[23466607] UNIX / -- /
fd list- 1cf: SOCK: socket:[23484823] UNIX / -- /
fd list- 1d1: SOCK: socket:[23476731] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1d2: SOCK: socket:[23490570] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/protect_path -- /dev/socket/dnsproxyd
fd list- 1d3: SOCK: socket:[23467117] UNIX / -- /
fd list- 1d5: SOCK: socket:[23476736] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1d6: SOCK: socket:[23488418] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/protect_path -- /dev/socket/dnsproxyd
fd list- 1d7: SOCK: socket:[23480411] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1d8: SOCK: socket:[23479978] UNIX / -- /
fd list- 1d9: SOCK: socket:[23461630] UNIX / -- /
fd list- 1da: SOCK: socket:[23467746] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1db: SOCK: socket:[23479988] UNIX / -- /
fd list- 1dc: SOCK: socket:[23467121] UNIX / -- /
fd list- 1dd: SOCK: socket:[23467126] UNIX / -- /
fd list- 1de: SOCK: socket:[23467748] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1df: SOCK: socket:[23480413] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1e0: SOCK: socket:[23466611] UNIX / -- /
fd list- 1e1: SOCK: socket:[23479119] UNIX / -- /
fd list- 1e2: SOCK: socket:[23480797] UNIX / -- /
fd list- 1e3: SOCK: socket:[23478626] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1e4: SOCK: socket:[23480440] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1e5: SOCK: socket:[23466617] UNIX / -- /
fd list- 1e6: SOCK: socket:[23477117] UNIX / -- /
fd list- 1e7: SOCK: socket:[23432212] UNIX / -- /
fd list- 1e8: SOCK: socket:[23480445] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1e9: SOCK: socket:[23480447] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
.................
Mygod commented 3 years ago

These seem normal. These fds are not closed? Does your server connection work properly?

imReker commented 3 years ago

These seem normal. These fds are not closed? Does your server connection work properly?

Most of them will be closed, or crash because of Too many open files. In some terrible network environment, UDP packet loss rate can be very high, and this issue would be triggered.

Key point of this issue is ifferent timeout in Java and Rust side: LocalDnsWorker use Java's getAllByName, it's timeout is defined by system, usually 90 seconds. But in Rust side, the timeout is 5 seconds. So, when the network is very slow or UDP filtered, a DNS query send to Java side, it will wait for system I/O until 90s timeout, but Rust side will fail in 5s and return the fail result to App who made the DNS request. And App will then request again, if App retry without any interval and count limit, LocalDnsWorker.accept will create thousands socket FD in this 90 seconds.

But I still don't know why socket of protect_path is leaked too.

Mygod commented 3 years ago

It is technically not a leak if they are eventually closed? Although I am down to tweak timeouts. Where did you find the 90s timeout?

imReker commented 3 years ago

90s timeout is an experience value by the log, it's not accurate. Though, it's not a traditional 'leak', but I think it is still an issue because of the different timeout and the thousands DNS retries it caused. Maybe 'deny of service' is more accurate? Currently, to solve this issue, I set a counter in LocalDnsWorker.accept, when pending DNS queries over 200, the accept just return an empty response to sslocal (this limit could be done in sslocal either). I think correct method to fix this issue is replace getAllByName by dnsjava, which can set a timeout on query. But we need modify it and makes caller can set a Network for it to create socket.

Mygod commented 3 years ago

Sounds good. I will take a look sometime.

Does this issue go away if you use the "All" Route?

imReker commented 3 years ago

Currently I use ACL with Bypass Lan. I think this issue doesn't exists in 'All' route case since DNS query will not be passed to Java side (so no extra FDs created) and it has 5s timeout.


And, maybe unix socket connection reuse ( ref #2751 ) is still needed? Because rust will make 2-3 DNS queries for 1 connection, there still has very little chance to create over 1000 FDs before the 5s timeout.

imReker commented 3 years ago

@Mygod I modified a little code of dnsjava(mainly Network related works and Java8/Android adaptation) and it works! Only downside is dnsjava 3.4.1 doesn't support Android 6.x because of Java NIO. (Old version support Android 6.x but it use blocking socket, so may still result same issue) I'll perform a stress test again tomorrow.

Mygod commented 1 year ago

Closing as Android versions too old.