pool running out (GetTimeout) on nightly

mikedilger commented 8 years ago

I'm getting some very strange behavior. My code works fine on a particular commit (pool used thousands of times, reclaims resources as connections fall out of scope). Then when I make the minimal changes necessary to upgrade to rust nightly (from 6-14 to 8-10) the (nearly) same code runs out of database connections quickly, even if I jack up the number of connections from my default 5 to any larger number, it will eventually consume them all.

The only changes I made to cause this to happen are:

upgraded rust from 6-14 to 8-10 (also tried tonight's nightly)
upgraded serde, serde_macros, serde_json (from 7 to 8)
upgraded regex_macros (0.1.37 to 0.1.38)
upgraded maud_macros (some commit to 0.9.2)
upgraded bincode (0.5 to 0.6)
upgraded handlebars (0.17 to 0.20)
minor method renames related to (2) above
a manual serialize/deserialize rewrite related to (2) above [that should not be in use]

Notice, no upgrades to r2d2, postgres, or r2d2-postgres

I've spent about 6 hours on this today. Tomorrow I'll see if a simple example exhibits the same behavior, then try bisecting rust-nightly versions.

sfackler commented 8 years ago

This sounds pretty bad - does it seem like the connections are never returned to the pool at all? The Debug implementation for Pool should print the number of connections it thinks it has - what does it say?

It could also be that something broke in Condvar::wait_timeout - does it seem like Pool::get is blocking for as long as you'd expect it to?

mikedilger commented 8 years ago

Pool::get is blocking for the 5 seconds or so before it gives the GetTimeout error.

How do I get the debug output? I'm compiling and running the debug binary, and also setting RUST_LOG=r2d2=trace,r2d2_postgres=trace but I'm not seeing anything.

sfackler commented 8 years ago

Sorry, debug as in println!("{:?}", pool).

sfackler commented 8 years ago

Could you also try running the r2d2 tests on nightly? They pass for me, and it'd be good to see if that's the case for you as well.

mikedilger commented 8 years ago

Yeah, the idle_connections keep dropping. Not every time though.

Tests pass on my machine too.

mikedilger commented 8 years ago

Also I watch my threads return and they definitely complete.

mikedilger commented 8 years ago

if I count the postgres processes from ps output that say "idle", they remain at 30 (what I started with), e.g. 30 like this:

postgres 12178  2780  0 16:01 ?        00:00:00 postgres: psf_php UPSF_core ::1(58506) idle

mikedilger commented 8 years ago

This is hard to reproduce. I've got a simpler version of my program (a project it forked from) compiled on nightly where this does not happen. So I can narrow this down by moving those projects towards each other. Still might be my code somewhere somehow, so don't stress about this yet.

mikedilger commented 8 years ago

Ok here is what I have found out so far.

If I do a let conn = try2!(row.get()); I never seem to have an issue.
If I instead do let conn = &*try2!(row.get()); then some functions seem to fail to drop the resource properly when they return.... not all functions, and I haven't determined what conditions exactly differentiate these cases.

Here is an example function that was causing the problem:

fn accessible_sites(app: &App, psuser: &Psuser) -> Result<Vec<i32>, Error>
{
    let sql = "SELECT site,name FROM person_site JOIN site ON person_site.site=site.id \
               JOIN person ON person_site.person=person.id \
               WHERE person_site.status='active' AND person.user=$1 \
               AND person_site.site IN (SELECT id FROM site WHERE status='active') \
               AND person_site.approved=true \
               UNION \
               SELECT id,name FROM site WHERE \"user\"=$2 AND status='active' \
               ORDER BY name";
    debug!("CONN +d");
    let conn = &*try2!(app.conn());
    let stmt = try2!(conn.prepare( sql ));
    let rows = try2!(stmt.query( &[&psuser.id, &psuser.id] ));
    let mut sites: Vec<i32> = Vec::new();
    for row in &rows {
        if row.len() < 1 { continue; }
        sites.push( row_get!(row, 0) );
    };
    debug!("CONN -d");
    return Ok(sites);
}

app.conn() is defined as

    fn conn(&self) -> Result<PgConn, GetTimeout> {
        println!("POOL: {:?}", self.pool);
        self.pool.get()
    }

where PgConn was defined as

// Shorthand for ridiculously long types
pub type PgPool = Pool<PostgresConnectionManager>;
pub type PgConn = PooledConnection<PostgresConnectionManager>;

I can work around this issue, but this appears to be a compiler bug.

sfackler commented 8 years ago

Yep, that seems like a codegen bug. I wonder if it's caused by the MIR switch?

sfackler commented 8 years ago

How is try2 defined?

mikedilger commented 8 years ago

Pretty sure the same thing would happen with try!(), but here is how it is defined:

/// A macro for trying an operation, and on failure, extending the error with file,
/// line and message information
///
/// Call with try2!( expression-yielding-result , ErrorType, "message")
#[macro_export]
macro_rules! try2 {
    ($expr:expr, $msg:expr ) => (
        match $expr {
            Ok(val) => val,
            Err(err) => fail2!(err, $msg)
        }
    );
    ($expr:expr) => (
        try2!($expr, "")
    )
}

/// A macro for extending an error with file, line, and message
///
/// Call with fail2!(e, ErrorType, "message")
#[macro_export]
macro_rules! fail2 {
    ($err:expr, $msg:expr) => ({
        let e: ::joist_error::Error = From::from($err);
        return Err( e.file(file!()).line(line!()).prefix_message($msg) )
    });
    ($err:expr) => (
        fail2!($err, "")
    )
}

mikedilger commented 8 years ago

Yep, verified with try!(), issue remains.

If you know which version of rust introduced the MIR switch, I can test the version prior to that.

sfackler commented 8 years ago

I'm not sure off the top of my head, but I think it was some time in the last month or so?

Since this seems like a codegen issue, we should probably move it to rust-lang/rust where the compiler folks will see it.

sfackler commented 8 years ago

I believe the upstream miscompilation issue has been fixed.

sfackler / r2d2

pool running out (GetTimeout) on nightly #25