Data indexing improvements

ncolomer commented 11 years ago

This PR batches some perf improvements about OSM data indexing.

apavillet commented 11 years ago

This is 10-15 times faster, thanks :)

ncolomer commented 11 years ago

Glad to hear! What about the response time of your query? Is it better too? It should be as I deactivated the _all field and the analyzer for the tags field :)

apavillet commented 11 years ago

@ncolomer As for the response time i'm afraid it's still not good, the first query can take up to 3 minutes. By testings i've narrowed it down to the the "bounding box" query which is responsible for taking so long. I dont really know what to do about it, i'm testing different configurations right now?

apavillet commented 11 years ago

@ncolomer I read somewhere that using "type": "indexed" on those queries can help performances but for that

Use optimize_bbox set to indexed (the default is memory). For that though, you will need to enable specific lat lon indexing in the geo_type mapping (and reindex).

Do you know how to enable indexing on latlong ?

Thanks

ncolomer commented 11 years ago

Hi @apavillet,

I just indexed a part of the Ile-de-France extract (ended up with ~16.5m docs). Then, I executed several times the query you provided. The first query took about 11s to return (es were probably refreshing or warming up its geo indexes) but followings - with varying locations - were all sub-second (actually about 200ms each).

What OSM extract are you indexing? France? Ile-de-France? Do you filter the extract using Osmosis?

You may also consider filter data using Osmosis prior indexing it in elasticsearch: it provides a lot of parameters that allow you filter entities by tag values (refer to the Osmosis advanced usage page). This could greatly reduce the total amount of docs indexed and thus improve query perfs.

I read somewhere that using "type": "indexed" on those queries can help performances

You probably refer to this post in the es mailing-list. The associated documentation can be found in the geo distance filter page. You should be able to activate the options with a query like this:

{
    "filtered": {
        "query": {
            ...
        },
        "filter": {
            "geo_distance": {
                "distance": "0.3km",
                "distance_type": "plane",
                "optimize_bbox": "indexed",
                "location": [
                    2.330646514892578,
                    48.84768833471799
                ]
            }
        }
    }
}

I'm curious to know the impact on your side. Quick tries show me that the "distance_type": "plane" make the query loose precision but do not decrease response time and the "optimize_bbox": "indexed" throws a 500, probably because I currently index entities's location as GeoJSON point (i.e. first lon then lat):

Can also have values of indexed to use indexed value check (make sure the geo_point type index lat lon in this case)

But actually, I'm not sure to understand this part of the documentation :)

apavillet commented 11 years ago

Hello,

Yes I did start with the same (ile de france) and it was fine, but then on production i indexed the whole france.pbf which is around 150 millions points. It is true I dont filter with osmosis and could definitely do that to reduce the number of points.

For the queries i created the mapping myself and could try the "optimize_bbox": "indexed" query which is little better but not quite fast enough ( you need to set "lat_lon" : true on geojson points at mapping for this query to work).

I'm currently trying to upgrade my servers with more heap space and will have some other tests tomorrow, i'll keep you posted ( we also created a thread on the ES groups to see if anyone has a lead or similar issue).

Edit : I also think that ES is trying to load everything in RAM before querying, which cause problem for a 150M+ points index. I don't know if you have a way to reduce the index size?

ncolomer commented 11 years ago

I'm currently trying to upgrade my servers with more heap space and will have some other tests tomorrow, i'll keep you posted ( we also created a thread on the ES groups to see if anyone has a lead or similar issue).

Yeah, I'm very interested in that, thanks to keep me tuned ;)

A precision about Sébastien's post nonetheless: this plugin actually tends to take advantage of the elasticsearch geo_shape feature. For now, it first built and maintain a raw index (the one you are using) containing entities exactly as they are provided by Osmosis. This means that, for ways for instance, I only have node ids but not related locations (relational data). I've done this as a bootstrap to build other "specialized indexes" where location(s) are indexed as geo_shape this time (as for the highway index builder).

But wait... writing these lines made me think of something else: if we can assume Osmosis always provides entities ordered (bound, node, way and relation) - what i'm verifying now - we willl be able to directly build final indices without any temporary index (at cost of indexing time cause ways will require to retrieve all nodes from elasticsearch). This is a great occasion to use directly the geo_shape feature on each entities. What do you think?

There is still some questions pending about the geo_shape feature (perfs, precision, usage). I already tried to reach Chris Male (author of this feature) about these but I had no answers yet.

Edit : I also think that ES is trying to load everything in RAM before querying, which cause problem for a 150M+ points index. I don't know if you have a way to reduce the index size?

As I know, to reduce index size, you can:

disable the _all field (done)
disable the source field (only at index time, but I need it to extract data)
compress the source field
disable store on each field (done for tags)
probably other things I'm not yet aware of :) Hope your post on the ML will reveal other tweaks

apavillet commented 11 years ago

I thought the exact same thing, the only problem i see is what you mentioned ( "Osmosis always provides entities ordered") and the performance on geo shapes which from what I read is not as good for point than geo point. But this would definitely help and make things a lot easier.

An update, I filtered the osm.pbf file getting only nodes with "names" and the resulting file is far less heavy ( around 450 000 nodes ) which also makes it easier to search.

As for compressing the source field, I did that and it did not change much.

ncolomer commented 11 years ago

and the performance on geo shapes which from what I read is not as good for point than geo point

Can you give pointers of that? I need to know if I should invest time on this or not :)

An update, I filtered the osm.pbf file getting only nodes with "names" and the resulting file is far less heavy ( around 450 000 nodes ) which also makes it easier to search.

Great! Does metrics are satisfying now as NRT queries (sub-second)?

As for compressing the source field, I did that and it did not change much.

I think the gain is more about size on disk, is it what you observed?

apavillet commented 11 years ago

One other problem, but which I think will be solved in future patchs (maybe already on 0.90 ?) are false positive : (https://groups.google.com/forum/?fromgroups=#!searchin/elasticsearch/geo$20shape$20point/elasticsearch/kPIGzVWgJAM/YSEnLAlmm1wJ and https://groups.google.com/forum/?fromgroups=#!searchin/elasticsearch/geo$20shape$20point/elasticsearch/J5EkhNnjMSw/K2CEco1-oEcJ)

I can't find the thread on the google groups about performances but it makes sense since geo_point are designed to be points while shapes can be anything from point to polygon, the parser has to perform more operations. What I don't know is how much impacts it has ( could be minor ).

Metrics are no yet satisfying but i'm working on that and they soon should be ;) As for the size on disk maybe i must confess i do not really have problem with that right now so I did not pay attention :)

ncolomer commented 11 years ago

Thanks for these valuable links talking about both precision and perfs! I'll read them with close attention ;)

I can't find the thread on the google groups about performances but it makes sense since geo_point are designed to be points while shapes can be anything from point to polygon, the parser has to perform more operations. What I don't know is how much impacts it has ( could be minor ).

Yep, I need to test that. I already started a new branch and will work on this as soon as possible.

ncolomer commented 11 years ago

Ok, so according to what I read, here is the situation:

Major changes about the geo_shape type are planned in the next 0.90 elasticsearch release
- The current custom implementation (XTermQueryPrefixTreeStrategy) is replaced by the Lucene 4.2 RecursivePrefixTreeStrategy by default, globally more accurate and more efficient.
- User can customize this strategy via mapping or directly in the query (for debugging and tweaking)
- We loose the within and disjoint shape relations, remaining the intersects only (nonetheless disjoint can still be emulated via query)
In the roadmap: extend the geo_distance filter + introduce a new geo_circle mapping type

Here is some links with all the details:

The commit changing geo_shape for the next release
The associated issue
A thread on the ML giving more details about this change

apavillet commented 11 years ago

Okay sounds good except for the fact that I think intersects only work for boundaries that crosses each other right ? Which mean I'm not sure you could, say, find all subways or restaurant of a city ?

ncolomer commented 11 years ago

Okay sounds good except for the fact that I think intersects only work for boundaries that crosses each other right ?

I had the same concern but after re-reading the doc, not necessarily:

intersects – Finds those indexed shapes which intersect with the filter shape. Intersection occurs when the two shapes have at least one shared grid hash. Because of current limitations of the algorithm, very large indexed shapes are not deemed to intersect with very small filter shapes. However, smaller index shapes will intersect with larger filter shapes.

My understanding of that (added to what I read) is the following: If you use a polygon as your shape filter, its area will be converted as a set of grid hashes (actually hash are lucene terms). The query will thus search for features that contain at least one shared grid hash. The within relation is more strict in that each grid hash should all be contained instead of only one match (probably more expensive).

The limitations comes when you index very large geo object (in terms of distance). Moreover there is also false positive due to the precision of the grid hash (will be reduced with the upcoming RecursivePrefixTreeStrategy).

Anyway, we need to check this and future improvements furthermore via tests over OSM extracts.

In the meantime, we can still match nodes and ways using geo_point (or array of) mapping and a distance filter... but as you already mentioned, your benchs highlighted performance issues with a big dataset (the grid hash approach is more scalable in such use cases).

apavillet commented 11 years ago

Hi @ncolomer ! I just ran accross this (https://github.com/jillesvangurp/osm2geojson) and thought maybe that could help you in some way. Hope it does ! :)

ncolomer commented 11 years ago

Hey @apavillet ! Thanks for pointing this to me, I'll take a close look ;)

ncolomer / elasticsearch-osmosis-plugin

Data indexing improvements #6