quinn-rs / quinn

Async-friendly QUIC implementation in Rust
Apache License 2.0
3.85k stars 394 forks source link

Host system UDP malfunctions #1216

Closed Archieeeeee closed 2 years ago

Archieeeeee commented 3 years ago

Hello, I have several app instances(written in quinn) running on my server, the apps work fine at first and I picked one instance to handle jobs and leave others alone, issue happens hours later: 1) none of the instances can be connected, trying to connect will get a timeout 2) the app handling jobs uses 20% of the system memory

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                       
 9604 root      20   0  407348 205836   6648 S   0.0 20.4   1:26.61 quicserver 

3) it seems my server has another DNS issue, e.g, I can not download files from github, hostanme 'github-releases.githubusercontent.com' can not be resolved, I have tried several other hostnames, some works fine (google.com), but some does not (mirrors.fedoraproject.org) , but still its not sure if this DNS issue existed before the issue here:

#yum update -y
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
Could not get metalink https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=x86_64&infra=stock&content=centos error was
14: curl#6 - "Could not resolve host: mirrors.fedoraproject.org; Unknown error"
 * base: mirrors.cat.pdx.edu
 * elrepo: elrepo.org
 * epel: mirrors.nipa.cloud
 * extras: mirror.sfo12.us.leaseweb.net
 * updates: mirrors.xtom.com
https://download.copr.fedorainfracloud.org/results/ibotty/prometheus-exporters/epel-7-x86_64/repodata/repomd.xml: [Errno 14] curl#6 - "Could not resolve host: download.copr.fedorainfracloud.org; Unknown error"
Trying other mirror.
https://download.copr.fedorainfracloud.org/results/librehat/shadowsocks/epel-7-x86_64/repodata/repomd.xml: [Errno 14] curl#6 - "Could not resolve host: download.copr.fedorainfracloud.org; Unknown error"
Trying other mirror.
No packages marked for update

but if I check the hostanme

#  host download.copr.fedorainfracloud.org
download.copr.fedorainfracloud.org is an alias for d1nld9ovj32u75.cloudfront.net.
d1nld9ovj32u75.cloudfront.net has address 65.8.158.75
d1nld9ovj32u75.cloudfront.net has address 65.8.158.79
d1nld9ovj32u75.cloudfront.net has address 65.8.158.28
d1nld9ovj32u75.cloudfront.net has address 65.8.158.27
d1nld9ovj32u75.cloudfront.net has IPv6 address 2600:9000:2146:400:4:bbc1:1840:93a1
d1nld9ovj32u75.cloudfront.net has IPv6 address 2600:9000:2146:2e00:4:bbc1:1840:93a1
d1nld9ovj32u75.cloudfront.net has IPv6 address 2600:9000:2146:ae00:4:bbc1:1840:93a1
d1nld9ovj32u75.cloudfront.net has IPv6 address 2600:9000:2146:9400:4:bbc1:1840:93a1
d1nld9ovj32u75.cloudfront.net has IPv6 address 2600:9000:2146:f600:4:bbc1:1840:93a1
d1nld9ovj32u75.cloudfront.net has IPv6 address 2600:9000:2146:7c00:4:bbc1:1840:93a1
d1nld9ovj32u75.cloudfront.net has IPv6 address 2600:9000:2146:d400:4:bbc1:1840:93a1
d1nld9ovj32u75.cloudfront.net has IPv6 address 2600:9000:2146:9200:4:bbc1:1840:93a1

Then I try to debug the issue, first I checked the server state: load is very low (0.1), cpu usage 1%, rem usage 20%, disk usage is okay too. I checked the udp state:

#cat /proc/net/udp   
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops             
   19: 00000000:698A 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 17803236 2 ffff9ee4cd7f1980 95     
   20: 00000000:698B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 17805492 2 ffff9ee4c81eb740 40     
   21: 00000000:698C 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 17805843 2 ffff9ee4c81e9100 90     
   22: 00000000:698D 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 17806188 2 ffff9ee4c81e8440 600    
   23: 00000000:698E 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 17805104 2 ffff9ee4cd7f3740 972  

network mem settings:

net.core.rmem_default = 104857600
net.core.rmem_max = 104857600
net.core.wmem_default = 104857600
net.core.wmem_max = 104857600
net.ipv4.tcp_rmem = 10240       87380   12582912
net.ipv4.tcp_wmem = 10240       87380   12582912
net.ipv4.udp_mem = 21639        28853   43278
net.ipv4.udp_rmem_min = 4096
net.ipv4.udp_wmem_min = 4096

I'm wondering how the app can affect other apps , maybe it could cause an system issue, I'm quite a newbie here and I need suggestions, thanks

Ralith commented 3 years ago

Quinn is not involved with, and cannot interfere with, DNS. That is likely a system configuration or network issue.

If connection attempts are failing, please share trace-level logs and packet captures from both sides.

Ralith commented 3 years ago

Were you able to find any more information? If both QUIC and DNS are having trouble you may simply have something wrong with UDP or networking in general, on that machine.

Matthias247 commented 3 years ago

Running quinn with trace logging on for a while and checking whether packets are still received and can be processed once the state had been entered might be useful to know more.

Likely unrelated, but:

net.core.rmem_default = 104857600 net.core.wmem_default = 104857600

100MB socket buffers as default? That seems at least one order of magnitude too much. It will mostly lead to high queuing latency, and not yield good performance. The software stack will just work on outdated old packets in case the queues run full.

Maybe the QUIC stack has eaten all OS socket buffers. But even if thats temporarily the case the buffers should be freed once packets are processed (if the server is idle).

Ralith commented 3 years ago

Maybe the QUIC stack has eaten all OS socket buffers.

Intuitively I'd think this is impossible if you're not going out of your way, but maybe the rules are relaxed when running as root (don't run as root).

Ralith commented 2 years ago

Closing for lack of engagement from reporter, and because it is unlikely that quinn is breaking the host UDP stack. Feel free to reopen if more information can be supplied.

Archieeeeee commented 2 years ago

Hello, Sorry for late reply, the issue did happen on one of my servers during development, but I have not met this issue anymore since I rebuilt the CentOS 7/8 OS on the server and this issue did not happen on other servers, so this is not an issue on Quinn.

Ralith commented 2 years ago

Thanks for the update!