owncloud / core

:cloud: ownCloud web server core (Files, DAV, etc.)
https://owncloud.com
GNU Affero General Public License v3.0
8.38k stars 2.05k forks source link

updateLastLoginTimestamp implementation yields extremly bad performance #15455

Closed jcfischer closed 8 years ago

jcfischer commented 9 years ago

The updateLastLoginTimestamp function that is defined here: https://github.com/owncloud/core/blob/master/lib/private/user/user.php#L162 uses an incredibly inefficient way of storing the data.

We are running 7.0.5

Our Postgres 9.3 server (which currently is running totally overloaded due to too few disk IOPS) shows me thousands of entries like this:

--
2015-04-08 08:19:49 UTC LOG:  duration: 3163.420 ms  execute pdo_stmt_0000001b: UPDATE "oc_preferences" SET "configvalue" = $1 WHERE "userid" = $2 AND "appid" = $3 AND "configkey" = $4
2015-04-08 08:19:49 UTC DETAIL:  parameters: $1 = '1428481185', $2 = 'xxx', $3 = 'login', $4 = 'lastLogin'
2015-04-08 08:19:49 UTC LOG:  duration: 1227.816 ms  execute pdo_stmt_0000001b: UPDATE "oc_preferences" SET "configvalue" = $1 WHERE "userid" = $2 AND "appid" = $3 AND "configkey" = $4
2015-04-08 08:19:49 UTC DETAIL:  parameters: $1 = '1428481187', $2 = 'xxx', $3 = 'login', $4 = 'lastLogin'
--
2015-04-08 08:19:51 UTC LOG:  duration: 2615.555 ms  execute pdo_stmt_0000001a: UPDATE "oc_preferences" SET "configvalue" = $1 WHERE "userid" = $2 AND "appid" = $3 AND "configkey" = $4
2015-04-08 08:19:51 UTC DETAIL:  parameters: $1 = '1428481189', $2 = 'xxx', $3 = 'login', $4 = 'lastLogin'
--
2015-04-08 08:19:51 UTC LOG:  duration: 2615.530 ms  execute pdo_stmt_0000001a: UPDATE "oc_preferences" SET "configvalue" = $1 WHERE "userid" = $2 AND "appid" = $3 AND "configkey" = $4
2015-04-08 08:19:51 UTC DETAIL:  parameters: $1 = '1428481189', $2 = 'xxx, $3 = 'login', $4 = 'lastLogin'
--
2015-04-08 08:19:51 UTC LOG:  duration: 2588.061 ms  execute pdo_stmt_0000001b: UPDATE "oc_preferences" SET "configvalue" = $1 WHERE "userid" = $2 AND "appid" = $3 AND "configkey" = $4
2015-04-08 08:19:51 UTC DETAIL:  parameters: $1 = '1428481189', $2 = 'xxx', $3 = 'login', $4 = 'lastLogin'
--
2015-04-08 08:20:31 UTC LOG:  duration: 37500.068 ms  execute pdo_stmt_0000001b: UPDATE "oc_preferences" SET "configvalue" = $1 WHERE "userid" = $2 AND "appid" = $3 AND "configkey" = $4
2015-04-08 08:20:31 UTC DETAIL:  parameters: $1 = '1428481193', $2 = 'xxx', $3 = 'login', $4 = 'lastLogin'
--
2015-04-08 08:20:31 UTC LOG:  duration: 37462.066 ms  execute pdo_stmt_0000001a: UPDATE "oc_preferences" SET "configvalue" = $1 WHERE "userid" = $2 AND "appid" = $3 AND "configkey" = $4
2015-04-08 08:20:31 UTC DETAIL:  parameters: $1 = '1428481193', $2 = 'xxx', $3 = 'login', $4 = 'lastLogin'
--
2015-04-08 08:20:31 UTC LOG:  duration: 37452.476 ms  execute pdo_stmt_0000001a: UPDATE "oc_preferences" SET "configvalue" = $1 WHERE "userid" = $2 AND "appid" = $3 AND "configkey" = $4
2015-04-08 08:20:31 UTC DETAIL:  parameters: $1 = '1428481193', $2 = 'xxx', $3 = 'login', $4 = 'lastLogin'
--
2015-04-08 08:20:31 UTC LOG:  duration: 20325.126 ms  execute pdo_stmt_0000001a: UPDATE "oc_preferences" SET "configvalue" = $1 WHERE "userid" = $2 AND "appid" = $3 AND "configkey" = $4
2015-04-08 08:20:31 UTC DETAIL:  parameters: $1 = '1428481210', $2 = 'xxx', $3 = 'login', $4 = 'lastLogin'
--
2015-04-08 08:21:33 UTC LOG:  duration: 22488.990 ms  execute pdo_stmt_0000001b: UPDATE "oc_preferences" SET "configvalue" = $1 WHERE "userid" = $2 AND "appid" = $3 AND "configkey" = $4
2015-04-08 08:21:33 UTC DETAIL:  parameters: $1 = '1428481270', $2 = 'xxx', $3 = 'login', $4 = 'lastLogin'

Mis-using the oc_preferences table for keeping this information seems like a really bad idea. It forces our database to constantly clean up the table as it accumulates a lot of dead tuples. Can this be moved to another table or to another (in-memory) database (redis comes to mind)?

LukasReschke commented 9 years ago

Just for the record: @jcfischer you missed to censor a mail address. I replaced it with a xxx as well.

MorrisJobke commented 9 years ago

@blizzz You introduced this

jcfischer commented 9 years ago

@LukasReschke - thanks!

We have disabled the update call and we have seen a massive improvement on our slow queries as witnessed by the time stamps on our last few slow query logs (I disabled it around 08:20):

-rw------- 1 postgres postgres 10485790 Apr  8 07:11 postgresql-2015-04-08_070849.log
-rw------- 1 postgres postgres 10485859 Apr  8 07:13 postgresql-2015-04-08_071140.log
-rw------- 1 postgres postgres 10486039 Apr  8 07:43 postgresql-2015-04-08_071356.log
-rw------- 1 postgres postgres 10486038 Apr  8 07:51 postgresql-2015-04-08_074339.log
-rw------- 1 postgres postgres 10485891 Apr  8 08:01 postgresql-2015-04-08_075134.log
-rw------- 1 postgres postgres 10489582 Apr  8 08:11 postgresql-2015-04-08_080148.log
-rw------- 1 postgres postgres 10486199 Apr  8 08:18 postgresql-2015-04-08_081103.log
-rw------- 1 postgres postgres  3372146 Apr  8 08:36 postgresql-2015-04-08_081824.log
MorrisJobke commented 9 years ago

Weird .... there is even an index on this table and it's just one update per user per login:

https://github.com/owncloud/core/blob/6c327f8331617652ef0b268d51edc6a23624e33c/db_structure.xml#L607-L623

@butonic You maybe know DB internals better. But having a simple update run for over 1 second up to over 37 seconds is really really bad.

jcfischer commented 9 years ago

@MorrisJobke the index actually makes the update more expensive because it too needs to be updated. Also: One update per login makes an additional n*2 updates per minute, where n number of active users... We currently have 8100 users (granted - not all of them active). Our DB server is struggling to keep up to date with the regular updates anyway, so any additional load further kills performance.

MorrisJobke commented 9 years ago

@MorrisJobke the index actually makes the update more expensive because it too needs to be updated.

I just noticed on a MySQL DB that adding an index before an update reduced the needed time, because the lookup of the corresponding entry was faster. But maybe this is different for Postgres.

nickvergessen commented 9 years ago

@jcfischer so you log out your users after 30 seconds? Note: login does not mean connect(ion)

Also can we have some numbers about this table? Like total entries, number of entries when grouped by user name, number of entries when grouped by app = login and key = lastLogin That would be awesome

jcfischer commented 9 years ago

@nickvergessen - I just counted 53k (slow) updates on this table only today. We separate web clients and desktop clients - I don't know if the desktop clients keep being logged in all the time:

$ grep  lastLogin postgresql-2015-04-07_*.log | wc -l
77262
$ grep  lastLogin postgresql-2015-04-08_*.log | wc -l
53569

Note that yesterday we removed the php session file locking because it took out our application server one by one (they were stuck in requests that were waiting for the session file to be released). That might explain the much lower number yesterday (when our users basically couldn't use the system)

            relname            | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup | last_vacuum |         last_analyze          |        last_autovacuum        |       last_autoanalyze
-------------------------------+-----------+-----------+-----------+------------+------------+-------------+-------------------------------+-------------------------------+-------------------------------
 oc_preferences                |      3355 |   2992451 |      1560 |      77105 |         28 |             | 2015-04-08 08:04:49.472805+00 | 2015-04-08 08:51:02.523624+00 | 2015-04-08 08:49:59.028155+00
owncloud=# select count(*) from oc_preferences;
 count
-------
 77398

owncloud=# select appid, count(*)  from oc_preferences group by appid;
     appid      | count
----------------+-------
 settings       |  8121
 contacts       |     7
 user_ldap      | 33667
 documents      |     1
 owncloud       |     1
 core           |   213
 login_token    |  6601
 firstrunwizard |  6875
 files          | 15532
 calendar       |     3
 login          |  3705
 files_sharing  |  2672
(12 rows)

owncloud=# select count(*)  from oc_preferences where appid = 'login' and configkey = 'lastLogin';
 count
-------
  3705
(1 row)
MorrisJobke commented 9 years ago

Note that yesterday we removed the php session file locking because it took out our application server one by one (they were stuck in requests that were waiting for the session file to be released).

This is something we should look into too. Something seems to block the sessions.

MorrisJobke commented 9 years ago

Note that yesterday we removed the php session file locking because it took out our application server one by one (they were stuck in requests that were waiting for the session file to be released).

This is something we should look into too. Something seems to block the sessions.

@jcfischer You mean that you removed the (previously added) comment here: https://github.com/owncloud/core/blob/v7.0.5/lib/private/session/internal.php#L52 So now it looks like that line.

Or am I wrong?

DeepDiver1975 commented 9 years ago

We separate web clients and desktop clients - I don't know if the desktop clients keep being logged in all the time:

so these 53k calls are web or client or both?

Would be great to get values for each of them - desktop client should not be responsible because the client reuses the session cookie.

nickvergessen commented 9 years ago

Well if you had problems with sessions (meaning you killed it and required everyone to login again) that would explain why many updates happened, however it does not really explain why it is that slow.

MorrisJobke commented 9 years ago

Well if you had problems with sessions (meaning you killed it and required everyone to login again) that would explain why many updates happened, however it does not really explain why it is that slow.

I would say, that this is just a misunderstanding. We rolled out the "don't close session early" to prevent too much parallel requests to avoid the possibility of our race conditions with lost fileids. @jcfischer can you confirm that this is the case?

jcfischer commented 9 years ago

@MorrisJobke - yes that is exactly what we did. Our application server processes where starving, waiting for the session file to become unlocked. It seems that a few clients can completely shut down an app server with 60 PHP processes doing that.

@nickvergessen - as Morris said, we have just implemented the things recommended (i.e. keep the desktop clients on one application server (sticky) instead of distributing them among all our servers with round robin from the load balancer and uncommented the "don't close sessions early" "fix"). As I said - that kills our complete system on a regular basis, as the app servers get stuck waiting for a request to clear the session lock, and the load balancer removing those servers from rotation - overloading the remaining servers.

Whoever thought that it was a great idea to write data to the PHP session should come here for a day or two and try to keep the system running.

jcfischer commented 9 years ago

@nickvergessen - just a perspective - our two web servers (Web GUI) handle around 1-2 requests per second - if at all. Currently our 4 sync servers (that handle desktop clients) handle 150 - 200 requests per second.

Also - we have huge problems with the database server (for once independent from ownCloud) due to the fact that we are using a distributed storage system (Ceph) to store the volume that the database resides on. This storage has limited IOPS (around 100) and sometimes loads itself by doing its own maintenance. At this moment we are racking up new servers for the DB that will use battery powered SSDs. We expect to have them up and running tomorrow.

butonic commented 9 years ago

@jcfischer where are you storing the session files? In the GPFS filesystem? I strongly recommend using a memcached or redis (starting with OC8) server for this. Don't even consider storing it in the database. It is a bad idea in the first place which gets worse considering your db storage resides on a ceph device (RBD or FS?).

jcfischer commented 9 years ago

@butonic - they are stored in tmp files on a ramdisk on the server. According to @karlitschek that was the fastest method. A central memcached adds another 1-2ms for access to the data.

MorrisJobke commented 8 years ago

The logins now shouldn't happen that often anymore because the sync clients now properly use the session and doesn't re-authenticate anymore (which cause the new login timestamp).

Also the last time we talked this doesn't seem to be a performance issue anymore.

I therefore will close this ticket. If this is still an issue for you please report back.

Thanks

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.