trouble limiting backoff

neonstalwart commented 12 years ago

i'd like to have a way to limit the backoff logic.

by default, opts.backoff is -1. when backoff is called, it will return a number for the setTimeout for reconnecting - first time it will return 0, next time 1000 then 2000 until backoff reaches backoffmax (which by default is 30000) and at that point, the value returned to setTimeout will be 30000000 and then it will return that number until a connection is successful.

my use case is simply to use findandbind to authenticate users. i open a connection, then in the callback i call findandbind and then in that callback i close the connection. if the LDAP server is not available to open, my callback is called with an error and i close the connection - at that point i don't want to keep trying reconnects.

    var LDAP = require('LDAP'),
        ldap = new LDAP({
            uri: 'ldap://ldap.foo.com/',
            backoff: 1,
            backoffmax: 1
        });

    ldap.open(function (err) {
        if (err) {
            console.log('error opening ldap', err);
            return ldap.close();
        }

        ldap.findandbind({
            base: 'dc=foo,dc=com',
            filter: '(&(objectClass=*)(uid=SOMEUSERID))',
            scope: ldap.sub,
            attrs: 'dn',
            password: 'SOMEPASSWORD'
        },function (err, user) {
            if (err) {
                console.log('findandbind error', err);
                return ldap.close();
            }

            ldap.search({
                base: 'cn=everyone,ou=whatever,dc=foo,dc=com',
                filter: '(objectClass=groupOfNames)',
                attrs: 'member'
            }, function (err, groups) {
                var group = groups && groups[0];

                if (err || !group) {
                    console.log('ldap search error', err || !group);
                    return ldap.close();
                }

                console.log('ldap auth passed:', group.member.indexOf(user.dn) > -1);
            });
        });
    });

using node-inspector (or by adding some debugging to node-LDAP) i can see that the backoff logic still applies even after i close the connection. is there a way to stop it? am i doing something wrong? i've tried to limit the backoff by using small numbers but it still keeps going.

i see this when developing locally because i don't have access to the LDAP server. you could probably see this issue by trying to open a nonsense url. it doesn't look like you have any tests that cover this kind of reconnecting.

neonstalwart commented 12 years ago

the problem is also compounded by the call to binding.simpleBind blocking until it fails. maybe https://github.com/joewalnes/node-ldapauth/blob/master/ldapauth.cc can shed some light on how to do this without blocking?

jeremycx commented 12 years ago

ldapauth.cc handles this using a second thread, which is do-able, for sure.

Yeah, um, well the backoffmax * 1000 is a pretty massive bug. Easily fixed, thanks.

There likely is another bug regarding reconnect and close() - can you explain what you're seeing a bit better?

I'm very interested in fixing the problems you are seeing, but I'd like to point out that you can findandbind on an already opened (and bound) connection - connect anonymously on startup, and then just call findandbind as your requests come in - it's way more efficient than opening a new TCP session with each request.

neonstalwart commented 12 years ago

ldapauth.cc handles this using a second thread, which is do-able, for sure.

that would be nice. currently, since the call to binding.simpleBind is sync, i'm being blocked waiting for the underlying connection attempt to timeout/fail. this seems like about 30 seconds (maybe more) and i'd rather not have to wait so long to find out that the connection is going to fail. if the bind was done in a separate thread, i would think that the JS could have it's own timeout to invoke the callback with an error sooner than when the connection attempt failed.

Yeah, um, well the backoffmax * 1000 is a pretty massive bug. Easily fixed, thanks.

thanks - i see the fix and i'll try it out.

There likely is another bug regarding reconnect and close() - can you explain what you're seeing a bit better?

i'll try to explain what i'm seeing. following the code in LDAP.js for when the connection is down here's what i think i'm seeing...

when i call ldap.open, it will call binding.open (this seems to have no problems) and then it tries to do an anonymous bind.
in doing the anonymous bind the following line is what blocks for ~30 seconds

 msgid = binding.simpleBind();

since there is an issue trying to connect, the next thing to happen (before the rest of the bind code continues) is that the disconnected callback (added via the call to binding.setcallbacks) is invoked (this is as a result of EMITDISCONNECT in the C++ code) and we are not currently reconnecting, so there is a call to reconnect.
reconnect closes the binding and then adds a call to setTimeout that involves the backoff logic (this first timeout is for 0ms using the default values).
once reconnect finishes, then the binding.simpleBind call mentioned above is now complete and msgid is -1.
the bind function then continues to update the stats and registers a callback via setCallback(msgid, bind, arguments, fn) (keeping in mind that msgid is -1).
setCallback then sees that the msgid is not >=1 and then invokes my callback with an error.
my callback sees an error and then tries to cleanup by closing the connection (it looks like this isn't needed, but i guess it isn't hurting anything).
shortly after this, the callback to setTimeout (added during reconnect) is invoked

var res = open(function(err) {
    if (err) {
        reconnect();
    } else {
        stats.reconnects++;
        opts.backoff = -1;
        replayCallbacks();
        reconnecting = false;

        if (syncopts) {
            sync(syncopts);
        }
    }
});

this immediately calls open again which results in blocking the thread again for ~30 seconds.
open eventually finds that there is no connection to the server and so the callback is called with an error.
reconnect is called again and now we are in a loop... reconnect -> open -> fail -> reconnect

I'm very interested in fixing the problems you are seeing, but I'd like to point out that you can findandbind on an already opened (and bound) connection - connect anonymously on startup, and then just call findandbind as your requests come in - it's way more efficient than opening a new TCP session with each request.

thank you... i may have a misunderstanding - is ldap.open a sync call? if not, how do you handle knowing when the connection is ready as requests come in? i can see that i could use process.on('exit', ... ) to close the connection without needing to know when it was ready but i'm not sure how i can handle knowing when the connection is ready to use without using the callback? is it safe to call ldap.open multiple times without calling ldap.close?

jeremycx commented 12 years ago

Excellent analysis! I've been looking for that bug for a really long time (have not been able to reproduce in captivity), but that sounds exactly what I've been looking for. I'll look at it forst thing tomorrow and let you know what I find.

I'll also write tomorrow regarding the proper use of open() et al.

neonstalwart commented 12 years ago

thank you. I also just realized (as I was falling asleep), the calls to reconnect will grow exponentially because during the course of calling open (from inside reconnect), the disconnect callback will call reconnect and so will the callback provided to open.

neonstalwart commented 12 years ago

I also just realized (as I was falling asleep), the calls to reconnect will grow exponentially...

the reconnecting flag actually guards against this happening. the disconnect callback won't call reconnect if it is already reconnecting...

jeremycx commented 12 years ago

I found an async connect option that I totally missed somehow - I'm working on getting open() to be fully async, which should solve a ton of problems. Give me a few hours.

neonstalwart commented 12 years ago

woo!

jeremycx commented 12 years ago

Haven't forgotten about this - just got really busy this week. Stay tuned.

neonstalwart commented 12 years ago

@jeremycx any progress on this? is there anything i could do to help?

i understand that open source projects are usually developed in people's spare time so i hate to apply pressure but i need to decide if i should be moving to another library in order to keep my project moving forward.

neonstalwart commented 11 years ago

hey @jeremycx is there any chance you might look at this again? i was hoping you might get around to taking a look at this and i'm actually using your library (and just hoping that the connection to the ldap server doesn't fail and cause my server to block waiting for a timeout). it sounded like you had found something that would fix this.

jeremycx commented 11 years ago

Sorry for not keeping you up to date.

I did look at it - for 3-4 evenings. What a mess (on the libldap side). Callbacks that never get called back. Weirdness abundant!

I'm currently looking at the reconnect logic again, which is, unsuprisingly, horrible broken if you have a syncrepl cannel activated. Sigh.

So, unless I get really weird and connect in a different thread (which has it's own share of considerations), I think we may have to live with block-on-connect for now. How long a block could you actually handle without too much trouble? I think I could tighten up the connect timeout to a small number of seconds which should keep you out of trouble.

neonstalwart commented 11 years ago

thanks for the response - i think currently it's blocking for somewhere in the order of 30 seconds before it has a timeout. i'm using this as part of the login process for users of a web application. so if the LDAP server ever happened to be down, users will be waiting about 30 seconds before the webserver provides a response to the browser. given that in my case the LDAP server is currently on the same infrastructure as the webserver i figure it's not too big of a problem - if one's down they're probably both down. however, if for some reason only the ldap server was down (or when we move to making them more independent of each other) then i think somewhere in the range of 2 - 5 seconds might be a better experience for the user.

jeremycx commented 11 years ago

The connect timeout has been added 1.1.2.

Also, lots of reconnect fixes are complete.

jeremycx / node-LDAP

trouble limiting backoff #23