theCrag / website

theCrag.com: Add your voice and help guide the development of the world's largest collaborative rock climbing & bouldering platform
https://www.thecrag.com/
111 stars 8 forks source link

stared schema changes for efficient faceted and leaderboard search #1168

Closed scd closed 11 years ago

scd commented 11 years ago

Lot's of details to fill in here (eg delete, merge & reparent nodes).

scd commented 11 years ago

On 9/08/2013 6:28 PM, Brendan Heywood wrote:

On Fri, Aug 9, 2013 at 4:13 PM, Simon Dale scd@thecrag.com wrote:

On 9/08/2013 2:32 PM, Brendan Heywood wrote:
I think it's pretty important that the new server is up to date, prod's age seems to be holding back a lot of possibilities.
I know. To do this safely it means we have to have another server. Remember how it took 3 days to get your dev operational again when we upgraded.
Basically I had expected to have done this by now, but I also expected the iphone app to have been out and the facebook stuff to have been done months ago.

I think I have found a more cost effective provider in Germany.
These are definitely low hanging fruit, but I'd also like to know the roughly gap of time between 4.5 seconds and 13.8 seconds in apache land. Perl land shouldn't be doing that much processing, so I'm guessing the rest is also in the DB but probably split between a bunch of smaller queries. In prod did you know what the background load was? ie do you think that 13.8 seconds is due to load from others, or load from the same page doing other stuff? I think the key here is setting the slow log threshold on prod to something like 200ms and collecting it for a while. It's also useful to set it to zero, and take a black box approach to seeing what a page load actually does, often you find a lot of redundant stuff that can be culled back.
Still even with all of this it's still a long way from snappy. Ideally I'd want to keep focusing on making the initial query faster rather than just looking into more caching options. One possibility is having TLC level stats, ie each TLC has a tick count for each climber. Under TLC level we use the normal query which seems to run in quite acceptable times for these. Then above TLC only join the collector table down the TLC's and then join to the summary stat. It would be good to get a feel for how many nodes and how many ticks that query can handle before it gets above say 200ms, it might be better to collect roughly at bottom level crags instead of TLC's. Or instead of linking it to node type but collect it whenever a node gets to X ticks which would be more efficient and robust and independent of type changes.
If we are going to do TLC stats we might as well roll it up to all areas. On average for every route there are about 8 collector nodes. I already have a mechanism to manage this, with the statistics package. It will be able to handle incremental updates to account/node stats. There is a bit of configuring to do. We would also have to consider how we would segment - eg ticks, onsights, boulders, etc as each could be leaderboards in their own right.

If that's easier then great, I was thinking in terms of the most bang for buck vs DB size. For now pragmatically the most important page is the climbers tab as it has to be super quick and we're only interested in totals, not the details. Facets are a different matter though but I'm sure benefit heaps from just that total stat first, as you could filter on it, and then filter that result set to get say just onsights.

We are really talking star schema design here which I think leads to a very involved discussion. Let's have this discussion sometime after the next release.
The other thought it to add an index order to the collector nodes so that join can be much quicker. Ie give every leaf node a global ordered number, each node above stores the range of nodes. The numbering would be fairly sparse so it means reparents often don't need major re-allocations. This could also be stacked on top of the previous speedup.

You can skip most of this if you want and jump to the last bit, most of it is my brain dump as I read and explored... There are about 3-4 major ways of working with tree data in SQL, Adjaceny list, Modified Preorder Tree Traversal, blah etc and they have different trade off's for insert vs read vs reparent. Oracle has a built in hierarchical query and index type which is super fast. I'm not sure which model it used internally. As a bit of history, when I build my own tree node DB for the first Beulah using Cake I used the TreeBehaviour which implements the MPTT one: http://book.cakephp.org/2.0/en/core-libraries/behaviors/tree.html What I had in mind is similar but more flexible, instead of the sets being continuous they are sparse so that it is much easier to work with reparents. Imagine that we had a simple business rule like no node could have more than 256 characters. Then we could represent any node as a string of bytes from the world to it. Australia was the first in the world it could be 1 NSW might be 2nd in Aus so it is 12 New England could be 7th in NSW so it is 127 A boulder in gara could be 127432 An the actual route could be 1274325 Now to find all routes under Gara it is simply a "like '1274%' which using a btree index is super fast. It completely removes the whole collector table and join which is the expensive one. The key thing is that the collector table isn't ordered, so it has to use the index O(n) times while the path method is O(1) Reparents would mostly involve just swapping out the substring of a set of routes which is pretty quick. If this is slightly out of date it's not the end of the world and there is currently a bit of lag of weird stats after big reparents, we can live with that. By sparse I mean that we purposefully make lots of gaps so that most reparents don't require changing any other routes or nodes. So in a cliff with only 20 routes their numbers could be 1,4,7,10 etc so you can insert new routes without changing the others, so most inserts will be instant. If you do insert say 5 routes it's not too much juggling to push the adjacent ones over. We could even have a process could comes along and respaces them as needed to keep them balanced. Routes could be sparse vertically too, so we could set some conservative rules around regions and TLC depths, for example: World Region level 1 Region level 2 Country Region level 3 Region level 4 Region level 5 TLC Whatever Lets say Asia is '3' and Thailand is its 5 child, instead of being 35 it would be 305. Because we always know which slot a country is in we can speed up certain country live lookups, and the same applies to TLC's. But the real flexibility here comes from a country being able to have quite a lot of internal region creation and reparenting done with less impact on it's siblings and cousins. After a bit of digging I found an oracle article which calls this idea a Materialized Path, I think it's also called path enumeration. But I think my idea is slightly different as I was thinking of using the order value at each position rather than the id of the ancestor node. To me the whole point of the path is to directly index it using a B-tree, you can't do that if you use the id method. From a maths point of view I think my string / path method is actually very close to a nested set, if you think of each char in the path as another decimal place of precision. We'd be just using a decimal number instead of rational fraction. This is the best summary of the big picture options I've found so far: http://stackoverflow.com/questions/4048151/what-are-the-options-for-storing-hierarchical-data-in-a-relational-database So you collector table is called a 'closure table' in some places I've been reading. This slide deck gives a pretty good comparison on them all, especially slide 69: http://www.slideshare.net/billkarwin/models-for-hierarchical-data ... God, I've just read a paper by some dude at Oracle on Nested Interval Trees with Continued Fractions. That is NOT something we want to hand roll. I guess one great thing about oracle and friends is that they can swap out these with increasingly faster implementations and we just don't know and get the benefit for free. So I'm still thinking that something like the path is the simplest to understand and has the most power. If we reparent a tree, the really critical thing is that afterwards we also propagate the new path values to the ascents table and other tables that we have facet queries for. This was the ascents facet query won't have any joins at all needed to do the /at/1234/ filtering which is what's killing us.

Can't picture what you mean here for leaderboards, but maybe useful for faceted search. Leaderboard is essentially and ordered list on accounts, if we have some star schema using some sort of node/account table then there is little to join. If we then have an index on each of the leaderboard stats then this should be fairly snappy. The table may look something like:

AccountNodeStatistics:

    Account
    Node
    Ascents
    Karma

I think stopping here is enough

    Onsights
    Trads
    Boulders

The problem with this approach is that we have to predefine all the the account/node statistics we are rolling up.

For ascent and route faceted search I would suggest a different approach where maybe each route and ascent has say 16 ancestor nodes as separate fields on each record. This would mean there would be no collector join for the facets and hence much faster. For example a route search in Australia would be a search trigger a search for Level1 nodes (World is level 0, and Australia is level 1). 

Ok only just got down to reading this. Essentially this solves the exactly same problem as the path above and in almost the same way. It means we'd introduce a hard limit of the depth of the node tree, but that's ok. I like that we've both more or less come up with the same solution, to me that's a signal we're on the right track. I don't there there is much between then, both would take about the same space. I prefer your solution because it's also got referential integrity, so it would be easier to find all grade 6 climbs under Australia, and then find their TLC. My way will have a slightly awkward join with a string subset. For this reason I think we should also combine this with the predefined slots idea, so country is always slot 2, and TLC is always slot 6, and unneeded slots are null as needed.

For both faceted search and leaderboard search the bottleneck is the AncestorCollector join. Maybe this is a common problem and there is some sort of index I am not doing right. This seems to be a common theme which comes up:

http://en.wikipedia.org/wiki/Nested_set_model

The nested interval model may actually be even more helpful.

http://en.wikipedia.org/wiki/Nested_intervals

Which ever way we go there is potentially a lot of work, so we really need to nut it out and agree on a path forward then test and demonstrate the improvement. The last thing I want is for one of us to do heaps of work and the other say why did you do it that way.

cheers Brendan http://thecrag.com/ http://github.com/brendanheywood http://www.thecrag.com/climber/brendanheywood

No virus found in this message. Checked by AVG - www.avg.com Version: 2013.0.3392 / Virus Database: 3209/6562 - Release Date: 08/08/13

http://thecrag.com/

scd commented 11 years ago

For leaderboard search we are going to add a Node Account statistics table, which will have a baseline and incremental mode and have change triggers for all of the system processes like reparent. Outwardly this table should have karma and ascent statistics on it. Karma is pretty straight forward, but ascent stuff is a little more tricky because of gym ascents, hits and dogs.

We need to decide on the exact business rules for ascents:

brendanheywood commented 11 years ago

Yeah makes sense. I'd suggest that the stat totals for tick types are mutually exclusive so they can be separated and / or added up more easily downstream.

ie instead of having 'all ticks' total and a 'ticks excluding dogs' total, and a 'ticks excluding targets total', we'd have 'pure ticks', 'dogs ticks', 'target ticks'.

scd commented 11 years ago

Ahh but there is something else we have got to think about. One of the reasons why we are doing this is speed. If I create the right indexes on the stat (Node,Account,AscentCount index) then theoretically the leaderboard should be practically instantaneous only requiring to look at the first 10 records, no mater how many people are in the table. If we are looking at the sum of multiple statistics then we loose the super fast index resulting in the whole set of node records being looked at. Don't forget that a leaderboard is based on an ordered list. Indexes can handle this, but as soon as you add a function then mysql has to assess each record.

If we are motivated by speed, which is the only motivation of this table, then we should build the fastest possible architecture.

So I would be happy to do individual tick type counts as well, but we should also do our default aggregated count.

This means we still have to make a business decision on what that default will be.

Does this make sense?

brendanheywood commented 11 years ago

Yes perfect sense. So the real question in turn becomes what is the leader board measure(s), we currently have 5, so it should be exactly those 5 that we store (ascents, unqiue ascents, karma, distance, comp points) at a minimum.

scd commented 11 years ago

Yup good point. Still we never really had the discussion about what should and should not be included in the 'ascents' leaderboard, as the leaderboard stuff was really a suck-it-and-see model of development. It currently includes hits, ghosts and dogs, which I imagine we will get shit for pretty quickly if the leaderboard becomes an important tool.

What is going to motivate me to get started on this is to get your two lists in the climbers tab working quickly. Other leaderboard stuff is less importand and in the noise. Karma is well defined and I think ascents is not correctly defined at them moment.

I have just looked at the ascents count presented at the avatar level in the system and it includes dog's, ghosts and hits.

I think leaderboard ascent stats should align with the avatar ascent stat. This means we should either do a simple ascent count in the leaderboard or we should exclude hits, ghosts and dogs in both leaderboard and avatar.

brendanheywood commented 11 years ago

Ok but there are two issues that we could separate and handle at different times. Whatever business logic we have to calc 'ascents' and 'karma' in the leader board should be the same for the climbers tab. After this work is done they should both read using the same function that calls the same query. How that data gets populated isn't critical right now, if we change the business rules for ascents to not include dogs, then that should be across the board, and those queries won't change at all, just the recalculated totals in this table.

The 'ascents' and 'karma' etc which are on the user profile should be the same logic, and in fact even the same table and query. Did you plan to aggregate up to the world node, or just under it? ie could you say, 'What is Brendan's Karma at the world node' to get my total? This would make things really consistent everywhere and we could potentially even ditch some of the existing stats as they would be duplicated.

scd commented 11 years ago

Now we are talking implementation details and yes I agree that once we have decided two things are linked they should be sourced from the same function. However there are some complexities that you should be aware of:

The statistics packages are pretty old code which is well overdue for some refactoring. I cannot do what you have specified without a certain amount of refactoring. If we link refactoring statistics package to this issue then I will just put this issue off for a long time.

There is very little duplication of statistics and where there is duplication it is because we are not using the full potential of the system. For example the only reason why account ascents stats and the new world account ascents stats may appear duplicated is that there is only one world root node. Conceptually we may have multiple root nodes, or private networks.

You are right about using the world account statistics for the avatar not the account statistics. The avatar should be associated with the stuff done on the world network. So once we have built the new node account statistics we can cut over. Problem disappears.

No scope to ditch statistics though because we may one day implement alternate private networks.

scd commented 11 years ago

OK, I have configured the stats table to store stats for the account node dimension. I have also put an index on each node/stat so that any leaderboard should be blindingly fast. And it is pretty good. The new table has one million records :)

time mysql --password= --socket=/var/run/mysqld/mysqld-cids-development.sock -e "SELECT SQL_NO_CACHE S.Account AS ID, S.TotalKarma AS K FROM RollupAccountNodeStatistic AS S WHERE S.NetworkNode=11740915 ORDER BY S.TotalKarma DESC LIMIT 0,20;" cids
+-----------+-------+
| ID        | K     |
+-----------+-------+
|   8262313 | 58580 | 
|  10721869 | 57246 | 
|  11183449 | 39976 | 
|   9441283 |  2743 | 
|   9623443 |  1551 | 
|   7783117 |  1255 | 
|  10765333 |  1101 | 
|   9280489 |  1023 | 
|   7727431 |   943 | 
|  10190965 |   856 | 
|   7945213 |   854 | 
|   9786175 |   844 | 
| 228084030 |   792 | 
|  10945087 |   762 | 
|   8870047 |   715 | 
|   8971927 |   627 | 
| 212853261 |   545 | 
|  10446043 |   528 | 
|  10825177 |   527 | 
|  10523707 |   500 | 
+-----------+-------+

real    0m0.021s
user    0m0.000s
sys 0m0.016s

Cool queries which also now benefit from this table.

I still have to integrate this into the perl code, but looking very promising. It pretty much convinces me we have to bite the bullet with the other starred schema improvement we have identified. I will put forward the exact proposal sometime later.

scd commented 11 years ago

Inspired by my last piece of work, I want to press on with the other starred schema improvement. The biggest win here will be area ascent and activity feeds, so that will be the prime focus (ie activity and ascent tables). That being said it is exactly the same work for photos, topos and discussions so I will make sure we can just plug these tables in as well if it works.

Basically the deepest index node is 11 nodes deep (root being level 0). There are only 191 of these guys. The histogram is something like. 11, 191 10, 3448 9, 22016 8, 61343 7, 104627 6, 76238 ... gets less and less.

An ascent/activity is already associated with a node so we don't have to cover include itself. This is actually a little bit deceiving because most of the level 7 nodes will actually be routes.

This means if we have a N0, N1, ... N9 fields on the ascent table then this would cover 99.9% of the index. Going down to N8 would cover 99% of the index and N7 90%.

Any query at the Level 9, or Level 10 should be fairly fast anyway with our existing AncestorCollector table join.

So I propose we store 9 extra fields in the ascent and activity tables (N0,...N8).

I am not going to worry about weather or not the node is a TLC as this put arbrary constraints on the thing.

Something that is already working in our favor is that every node is already storing a level and that we only have one ancestor path to the world node.

This means if I am doing a query on Arapiles (level 4) then I do a join with N4. Fast.

I also make sure that I index N0, .. N8 each with create date. Time based feeds, super fast.

If we query on a node below level 8 then we just use the current method. Not so fast, but not to bad at the ends of the index.

So this is the easy bit. The hard stuff is then managing the levels on reparent etc.

I need to think about the following processes:

Does this makes sense or have I missed something????

brendanheywood commented 11 years ago

On Thu, Aug 22, 2013 at 8:20 PM, Simon Dale notifications@github.comwrote:

Inspired by my last piece of work, I want to press on with the other starred schema improvement. The biggest win here will be area ascent and activity feeds, so that will be the prime focus (ie activity and ascent tables). That being said it is exactly the same work for photos, topos and discussions so I will make sure we can just plug these tables in as well if it works.

Basically the deepest index node is 11 nodes deep (root being level 0). There are only 191 of these guys. The histogram is something like. 11, 191 10, 3448 9, 22016 8, 61343 7, 104627 6, 76238 ... gets less and less.

An ascent/activity is already associated with a node so we don't have to cover include itself. This is actually a little bit deceiving because most of the level 7 nodes will actually be routes.

This means if we have a N0, N1, ... N9 fields on the ascent table then this would cover 99.9% of the index. Going down to N8 would cover 99% of the index and N7 90%.

Any query at the Level 9, or Level 10 should be fairly fast anyway with our existing AncestorCollector table join.

So I propose we store 9 extra fields in the ascent and activity tables (N0,...N8).

Cool. I'll let you make the call about which level, it's really space vs speed so I'd err towards higher if we've got the disc.

I am not going to worry about weather or not the node is a TLC as this put

arbrary constraints on the thing.

Something that is already working in our favor is that every node is already storing a level and that we only have one ancestor path to the world node.

How does the linked nodes near border work? We're not using it much or at all so we could drop that if needed.

This means if I am doing a query on Arapiles (level 4) then I do a join with N4. Fast.

I also make sure that I index N0, .. N8 each with create date. Time based feeds, super fast.

If we query on a node below level 8 then we just use the current method. Not so fast, but not to bad at the ends of the index.

So this is the easy bit. The hard stuff is then managing the levels on reparent etc.

I need to think about the following processes:

  • adding a ascent: create N0, ... Nx fields.
  • logging activity: ditto
  • reparent area: every descendant ascent & activity will need there Nx fields redone. Note we also must redo directly associated activity nodes
  • reparent route: every associated ascent & activity redone
  • merging: brain seizure
  • deleting: by definition we cannot delete nodes with ascents or activity.

Does this makes sense or have I missed something????

Yeah this seems pretty solid. The reparenting stuff I'm guessing is a background process that churns through so there could be 10 seconds after a resturcture where searches will be misaligned. I don't think that's an issue. It would be nice if we could set a flag somewhere so we can show in the template that stuff is going on, but I'm not how we can add and remove this flag any faster than just processing the data itself.

— Reply to this email directly or view it on GitHubhttps://github.com/theCrag/website/issues/1168#issuecomment-23080536 .

cheers Brendan

http://thecrag.com/

http://github.com/brendanheywood http://www.thecrag.com/climber/brendanheywood

scd commented 11 years ago

Sounds like a plan.

Not using the border link stuff yet, which I am glad, because it would ruin this solution.

My thoughts exactly about the flag, because it would really have to apply to every descendant.

Maybe we could just show an indication that there is something in the queue being processed without worrying about weather it applies to this search or not.

brendanheywood commented 11 years ago

Sweet lock it in

On Fri, Aug 23, 2013 at 1:23 PM, Simon Dale notifications@github.comwrote:

Sounds like a plan.

Not using the border link stuff yet, which I am glad, because it would ruin this solution.

My thoughts exactly about the flag, because it would really have to apply to every descendant.

Maybe we could just show an indication that there is something in the queue being processed without worrying about weather it applies to this search or not.

— Reply to this email directly or view it on GitHubhttps://github.com/theCrag/website/issues/1168#issuecomment-23141272 .

cheers Brendan

http://thecrag.com/

http://github.com/brendanheywood http://www.thecrag.com/climber/brendanheywood

scd commented 11 years ago

Ticklist for simon:

scd commented 11 years ago

I think this issue is not done. Two other issues forked off from this are #1211 and #1209