kohana / database

A Kohana module for database interactions, building queries, and prepared statements
http://kohanaframework.org/documentation
158 stars 140 forks source link

Session trouble with searching-bots and Session::write() (more with Database_Session) #92

Closed ghost closed 9 years ago

ghost commented 9 years ago

Two weeks ago i run new site and made some actions to populize him in Google, Yandex and other search-engines. But... After my good work, i saw, that in my "sessions" table are more than 100k nodes. That was sessions of bots (Google, Yandex, etc.). This happened, cuz bots not accepts cookies, so to each request kohana made another one node to sessions table. Ofc, thats not OK, and this cant solve by garbage collection cleaning. So i solved that by adding to "modules/database/classes/Kohana/Session/Database.php (Kohana_session_Database class)" next code in 138 line (in function _write):

if(Request::user_agent('robot') != '')
{
    return TRUE;
}

(also i've added to user_agent.php a few new user_agents (of bots), you can see it on this link http://pastebin.com/kpJdKz1D);

So, that works! Buy, by science, making changes in system modules/files is very bad and i transfered it to my helper-module by this path: "myhelpermodule/classes/Session.php":

abstract class Session extends Kohana_Session
{

    public function write()
    {
        if(Request::user_agent('robot') != '')
        {
            return TRUE;
        }

        return parent::write();
    }
}  

So, that works, but looks very strange. Can you, please, say something about my trouble, my solve and this situation ?

(and, to more information my session config file: http://pastebin.com/rVBGTzgD)

enov commented 9 years ago

I think your solution is good.

ghost commented 9 years ago

Ok, thanks. But, what can you say about this trouble? How community fixes this trouble? (With creating session nodes for each bot request)

enov commented 9 years ago

@coll3ctor there seems to be a garbage collector that runs after each 500 requests.

https://github.com/kohana/database/blob/3.3/master/classes/Kohana/Session/Database.php#L37

Once your sessions are considered as expired, the GC will clean them.

ghost commented 9 years ago

@enov In one second, my server took 3-4 request, so it 3-4 new nodes in sessions table. Garabage collector can't help me, because, if my session expiring time is Date::DAY (for example, but, it is very small period), numberOfNodesPerDay = 24 * 60 * 60 * 3 = 300k nodes. With that number, my site can't work correctly.

Or i not understood correctly your answer ?

enov commented 9 years ago

If robots are piling up ~300k rows in your database, then you need to do something about it :)

If we should to do something about it in the framework, then the same applies for native sessions and for other sessions as well. After all native sessions are saved as files by default and your case will pile up 300k files somewhere in session.save_path.

Now, I see that your solution is good and applies to the abstract class Session which will fix everything else, but still not 100% convinced that we should go for it. Are there any robots that store sessions? Do we break anything else if we don't write to the session in case of robots?

I wonder what others think about this.

ghost commented 9 years ago

Ok, very big thanks, i understand!

You mean, what i can also (may be) divide sessions by directories (if i use 'native' driver), and i will not touch a overloading (exception overloading of my hard disk, ofc)?

As i saw, Yandex, Google, Yahoo, MailRu, Rambler bots - everyone not accepts cookies, so everyone of them fills up my database :( I am totally sure, what we will not break anything, using my solution (only if someone uses bots 'useragent', but, this guy, i think, is very fishily for my site, and this will be good, if i don't give him cookie & session node & authorization on my site). And i am totally sure, that trying (and saving) to save sessions for bots is totally useless and harmful actions.

Also, about checking of bot: checking is pretty strictly and i think, thats OK.

Another one question: Should i redeclare class Session_Database and not Session? Because, i am working with session just on "database" area (yep, i see your advices about native session).

acoulton commented 9 years ago

I think this is something that needs to be fixed on a site-specific basis. In my experience it's highly unusual to get that volume of crawlers (on an ongoing basis) unless the site itself is also taking huge amounts of traffic (in which case you need to architect session storage for scale anyway).

I'm not aware of any other framework that automatically disables sessions for bots, so it would be quite unexpected. Some website monitoring services and potentially some desired crawlers will need to authenticate and use sessions - so this could break things for people.

I wouldn't object to having an option for this behaviour, but it shouldn't be default.

enov commented 9 years ago

:+1: for a configurable option, @acoulton, and yes, should not be default.

@coll3ctor no need to redeclare Session_Database. Your Session extension looks good.

loonies commented 9 years ago

I really don't see how robot filtering has to do anything with session. One can spoof this very easily. With no practical benefit and maintenance overhead I think this should not be included at all.

enov commented 9 years ago

@loonies

only if someone uses bots 'useragent', but, this guy, i think, is very fishily for my site, and this will be good, if i don't give him cookie & session node & authorization on my site

@coll3ctor is arguing that he does not want to serve cookies to the one who's spoofing user-agent either.

loonies commented 9 years ago

Regardless of spoofing, this is not reliable. I can set my browser's UA to a robot string, or vice-versa for whatever reason and my browsing will fail, or a robot crawler will pass.

ghost commented 9 years ago

@acoulton Ok, i understood you. And i argee with you. (and my site really accepts a large volume of traffic (it is a long time to describe this)). @enov Ok, nice. And, can u say, pls, what about method, which i should redeclare? Is it public write() or protected _write()? @loonies And what solution you can offer?

loonies commented 9 years ago

Handle bots on application level. Think about it as a CAPTCHA challenge.

acoulton commented 9 years ago

@loonies I don't think a CAPTCHA works for this issue - a GET to any page will cause a session to be created. That said you're right, user agent sniffing is probably the least good option.

@coll3ctor do you actually need to create sessions for people just browsing? Could you instead make it so only logged in users have a session? If you're only storing a small amount of data you could also use cookie sessions (or cookie sessions for guests and database sessions for logged in users if you need to store more about them) which wouldn't need any storage on your side.

More I think about it, I think we shouldn't fix it like this in core (even as an option). Would it help if we made it so the session is only written if there is data in it?

loonies commented 9 years ago

My point was that @coll3ctor should think of solving this problem the same way you would solve the problem with bots commenting, creating accounts, etc. on your website, hence mention of CAPTCHA challenge.

acoulton commented 9 years ago

@loonies I understand, but I genuinely have no idea how you would identify bot vs human on a GET request (other than user agent)? CAPTCHA/Honeypot/etc are only useful on form submission or other interaction like you mention.

loonies commented 9 years ago

Well, yeah, UA check, like @coll3ctor did in the first post.

Just thinking out loud ... If this is ever going to be implemented in the Session class, than we could make a session filter. Something like

$agents = array('Google', 'Yahoo');
$filter = new UASessionFilter($agents);
$session = Session::instance()->add_filter($filter);

Maybe split add_filter to add_read_filter, and add_write_filter.

Again, I think it's only maintenance overhead. This is the first use case I remember of since I been using Kohana. Not sure how many people would find this feature useful.

ghost commented 9 years ago

@loonies, @acoulton I think, this is talking about CAPTCHA on conception level, right ? @acoulton

  1. No, i don't need sessions for whole users, just for logged-in;
  2. "or cookie sessions for guests and database sessions for logged in users if you need to store more about them" - say, please, how i can make this? :)
  3. "Would it help if we made it so the session is only written if there is data in it?" - i think, yes

@loonies Did you mean, what UASessionFilter can inherited from Session_Filter, for example? (So, make it on Kohana level) But, is it too difficult, for making classes, etc. for this trouble?

PS @acoulton What can you say about adding to session user_id? On Kohana level (not just by me). By this we can easy count online users, for example. Also we can detect for concrete user, is he online. Yes, i read disuccions about it in some Stackoverflow topics and offical Kohana forum.

loonies commented 9 years ago

@loonies, @acoulton I think, this is talking about CAPTCHA on conception level, right ?

Yes

@loonies Did you mean, what UASessionFilter can inherited from Session_Filter, for example? (So, make it on Kohana level)

Yes, extend abstract class or implement interface.

But, is it too difficult, for making classes, etc. for this trouble?

I don't get the question. What is difficult?

acoulton commented 9 years ago

I understood that @loonies means CAPTCHA as concept, was just curious if he had a suggestion for identifying bot on a first GET request to the site as I couldn't think of anything (except user agent check).

Cookie sessions for guests and database sessions for logged in users if you need to store more about them

Simplest way probably in your bootstrap:

if (Cookie::get('authenticated')) {
  Session::$default = 'database';
} else {
  Session::$default = 'cookie';
}

And then when a user logs in or out, call Cookie::set('authenticated', TRUE|FALSE);.

Better approach would probably be a custom session driver that internally uses a cookie driver and a database driver to load merged session data, and then splits writes between the two types of storage as required - only writing the database session if there's stuff in it.

What can you say about adding to session user_id? On Kohana level (not just by me)

I can see the usecase, but user_id is quite an application-specific concept, could be issues trying to implement something that would suit all applications. Also if you want to eg load by user ID that means adding a method to the database driver that can't be implemented in cookie sessions at all and would be difficult in native.

IMHO would be better to produce a separate package with a custom database session driver that stores/retrieves by user ID etc rather than adding that in core.

ghost commented 9 years ago

I don't get the question. What is difficult? @loonies This is very simple and small trouble. And for this trouble we think about make a large module, a few classes, inherited from something and etc. etc. etc. This is trouble just about overflowing of database table, but we think about so large work! :) (so, maybe I did not say this very true in english)

@acoulton Ok, thanks, i understood. So, can we (community) wait something from Kohana after our discussion?:)

acoulton commented 9 years ago

@coll3ctor you can wait if you like, but given there are very few of us with very little time, and nobody else is reporting real-world problems with this you are likely to be waiting a long time....

I would suggest building a custom session driver yourself and releasing it as a kohana module.

ghost commented 9 years ago

@acoulton Current version of Kohana and current number of Kohana modules is pretty good, so, ofc, we wait :)

Main of that, that we solved trouble.Aalbeit with a little crutch, but solved!

So, thanks all, and close that issue!