sfackler / r2d2

A generic connection pool for Rust
Apache License 2.0
1.51k stars 82 forks source link

Failed connection attempts result in leaked file descriptors #100

Closed kevinpoitra closed 4 years ago

kevinpoitra commented 4 years ago

To preface: I'm using r2d2 within a web service, running on a freshly installed CentOS 8 server. r2d2 is imported into my project via Diesel's r2d2 feature. The databases that r2d2 talks to are on separate servers, and share a network.

Recently, one of my services suffered from a network outage that affected the network between the program that uses r2d2 and the database. When the network came back online, the program still wasn't able to respond to requests; its log was showing many Too many open files errors as it had hit the default soft limit for open files on the system (1024), causing all further network requests to also throw similar 'too many open file' errors until I manually restarted the process.

From my testing, any failed connection within r2d2 causes this behavior. To test, I first got an idea of what the program looks like with a working network:

Then, I severed the connection between the program and a PostgreSQL database that it talks to. Anything that causes r2d2 to throw errors seems to work; in my case I removed PostgreSQL's entry from the database server's FirewallD zone. This made the program immediately start logging "No route to host" errors from r2d2.

Using that same lsof command from above every few minutes, I monitored the open file count, and it seems like it goes up once per failed attempt, starting from 92 open files right when the firewall was modified:

Inspecting the output of lsof at this point shows many entries that are near identical to the following:

COMMAND     PID     USER   FD      TYPE     DEVICE SIZE/OFF     NODE    NAME
xx          25972    xx   245u      sock      0,9      0t0     406107 protocol: TCP

At this point, I re-added PostgreSQL's port to the firewall, which allowed r2d2 to create connections successfully. I ran the lsof command again after waiting 20 minutes, and it showed that the count hadn't changed from 627. From what I saw after that network outage, r2d2 will never release those open files - I have about six hours worth of logs of nothing but r2d2 throwing 'too many open files' errors every time it'd try to acquire a new connection every few seconds, wherein the network came back online about 2 hours into those logs.

Some additional potentially helpful information:

sfackler commented 4 years ago

It seems like the problem would have to be with Diesel rather than r2d2. If diesel fails to connect to the database, there's nothing for the pool to leak.

kevinpoitra commented 4 years ago

Thanks for the quick response!

You're right - I guess I was thrown off as my logs show r2d2 as the module logging these errors, like so:

ERROR r2d2 - could not create socket: Too many open files

When in reality, r2d2 just catches Errs when it tries to connect and prints them out on its own, here. Should've looked a bit more into it!

Closing this as I've moved the issue over to Diesel's repo.