mstein / elasticsearch-grails-plugin

ElasticSearch grails plugin
Based on Graeme Rocher initial stub. Note that it is still in early stage.
Other
62 stars 164 forks source link

A selection of minor changes, including coping with larger data sets #3

Open spither opened 13 years ago

spither commented 13 years ago

Hi,

These commits cover several little problems I found and fixed, including adding support for a cluster.name config option and allowing mass-index operations to survive for larger data sets (eg 50,000+ rows on a 20 field table).

Simon

mstein commented 13 years ago

Hi spither, sorry for the very late answer, but I was (and still is) kind of on vacation so I didn't really looked into the project for some time.

I took a look in your changes, and noted that there are some changes already done on the HEAD, like the cluster.name config (was committed just a few days before your pull request actually). I'll accept the 2a6b6fb commit at least, "force per-domain methods to always set domain based filters".

For your hibernate session split (which is a good idea), there's just a thing that bother me : you're assuming that the id field of the domain instances are numeric values (max() on the id, sort on the id), which may not be the case everytime.

spither commented 13 years ago

Sorry, I didn't notice the cluster.name additions - I'll rebase and merge things in my fork soon.

If you've got any suggestions on changing the Hibernate session split, I'd be happy to put in a little extra work to improve it?

mstein commented 13 years ago

Probably using an offset (firstResult/MaxResults) instead of the "findAllByIdGreaterThan" is the first step to look into.

spither commented 13 years ago

Unfortunately using an offset isn't suitable as performance of offsets in MySQL (and possibly others?) sucks:

http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/ http://forums.mysql.com/read.php?20,428637,428771

Which means that approach definitely isn't going to be suitable for anyone using MySQL, which includes me in this case.

Perhaps a config option to switch between offset and numeric id? It could default to offset so that it works out of the box, but to make it perform well for MySQL, the config option could be set (assuming the user has a numeric id)?

mstein commented 12 years ago

A small update about this issue (yeah I know, it was about time... sorry about that) I've implemented the solution with the offsets (should be generic), and I'd like to run a few benchmarks with MySQL to see if the performances are really that bad and compare them with the numeric id-based solution. If there is a noticeable difference, then I'm ok to include the id-based implementation as a user-configurable mode. Have you ran a benchmark yourself and noticed that much of difference in performance ?

spither commented 12 years ago

These are some very quick numbers (from direct mysql queries, not indexing code) on a fairly small testing table. They are taken on an un-loaded system and while I'm only going to paste one example of each command, I've run them several times and the timings were very consistent. I've omitted the actual data returned.

Just to size the table:

mysql> select count(*), max(id) from hit_info;
+----------+---------+
| count(*) | max(id) |
+----------+---------+
|  1689490 | 1690211 |
+----------+---------+

mysql> select * from hit_info limit 0, 1;
1 row in set (0.00 sec)

mysql> select * from hit_info limit 1000, 1;
1 row in set (0.00 sec)

mysql> select * from hit_info limit 100000, 1;
1 row in set (0.03 sec)

mysql> select * from hit_info limit 1000000, 1;
1 row in set (0.32 sec)

mysql> select * from hit_info limit 1500000, 1;
1 row in set (0.49 sec)

So to index row 1 million to row 1.5 million, 1000 rows at a time (my tests on some wide tables needed many less rows at a time to avoid memory exhaustion, but it makes the maths easy) would take 500 queries averaging 0.4 seconds each which is about 3.3 minutes.

However an equivalent query using IDs:

mysql> select * from hit_info where id > 1500000 limit 1;
1 row in set (0.00 sec)

...is too fast to measure in such a simple test. Quite clearly it's going to be an awful lot faster than the LIMIT based approach though.

Sorry the numbers are direct SQL queries rather than running test code but hopefully they illustrate the problem (with MySQL!).

confile commented 11 years ago

@spither du you have an updated version of the elastic search plugin for version 0.90.3?