quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.91k stars 666 forks source link

Geo-search #44

Open fulmicoton opened 7 years ago

fulmicoton commented 7 years ago

This would require a longlat field type... (and coords if we want simple 2D as well

Then the user would want to perform queries such as

Fiedzia commented 7 years ago

Two more practical requirements that I would love to have:

  1. Store multiple values
  2. Return which value was found

those two things have a range of usecases that are very practical for me

hwchen commented 5 years ago

Curious about the status of this issue. Is this still a desired feature?

I have a need for geo queries in a future project, and would be willing to help implement, although I might need some guidance.

fulmicoton commented 5 years ago

@hwchen Sorry about the late answer

This is a desired feature but this is also a lot of work. I'm thinking of following Lucene idea of using Bkd Tree as it fits well the search indexing scheme.

You can work on it if you are confident, but be careful... This is one of the hardest ticket around and I won't be able to give much guidance.

hwchen commented 5 years ago

Thanks, I'll keep this in mind. Maybe in a year or so I'll pick this up if nobody else has :)

natew commented 4 years ago

Plus one on this feature, very relevant to any local app - Uber Eats, Yelp, Maps, Delivery.

shuoli84 commented 3 years ago

FYI: did a quick search, found one nice bkd tree impl https://github.com/peermaps/eyros

PSeitz commented 3 years ago

Nice, rucene also has a bkd implementation https://github.com/zhihu/rucene/blob/master/src/core/util/bkd/bkd_writer.rs

shuoli84 commented 3 years ago

FYI: did a quick search, found one nice bkd tree impl https://github.com/peermaps/eyros

Hmm, seems the license of eyros is not compatible with MIT.

shuoli84 commented 3 years ago

Another FYI.

From https://www.elastic.co/blog/lucene-points-6.0, the blog which announced lucene spatial support, there is a quote on future work:

Currently, dimensional values must be single points, but some geo-spatial use cases require indexing shapes as well, which can be done with a generalization of the k-d tree called an R-Tree, for example, and I expect that will be a future improvement to Lucene (patches welcome again!).

And there is a rtree crate called rstar, which provides a good in mem rtree. https://crates.io/crates/rstar

Jure-BB commented 2 years ago

Point search specific methods are easier to implement since they don't require specialized trees and they also perform better. It would make sense to implement point search first as it is probably most commonly used, and implement indexing of shapes at a latter time.

Steps to index geo coordinates (points):

  1. Convert coordinates form double floating points to integers. If they are converted to 32bit (by multiplying with 10,000,000) we get resolution of about 1cm. OSM stores coordinates as 32 bit integers and Solr does too.
  2. Bit-interleave latitude and longitude into a single 64 bit integer. This value is the Morton code of a Z-order curve. (There are other space filling curves, like Hilbert curve, that give better spatial locality, but due to additional computational complexity it is questionable, if they would provide better performance overall. Morton code calculations are simpler and can also be accelerated with SIMD.)
  3. Index that 64-bit integer to any tree that can perform 1D range search (eg Btree)

Steps to query are a bit more complex:

  1. Get AABB (or any other shape) of the area we want to query
  2. Calculate where the Z-curve intersects with the AABB edges to get ranges we need to query. This is the complex part. Some names of these functions are BigMin/LitMax, NextJumpIn, GetNextZ-Address. In a more recent paper[1] I saw a method of using quadtrees to calculate these regions, and if I remember correctly, it also performed a bit better.
  3. Merge ranges into longer continuous ranges where possible.
  4. Query ranges to get the Morton codes of the points inside that area.
  5. Since, ranges intersecting with the query-shape can cover a bit larger area that we need (in order too keep the number of ranges to a sensible amount), we need to filter resulting points, to exclude points that falls outside of exact query-area. Morton codes can be converted back to latitude and longitude coordinates in order to perform that filtering.

[1] I can look-up the name of paper using quadtrees to calculate ranges, if anyone is interested.

umutbasal commented 1 year ago

+1

some resources other engines uses

adamreichold commented 1 year ago

See https://github.com/quickwit-oss/tantivy/pull/2130 for a version using in-memory rstar indexes and warmers spelled out. If there is sufficient interested, I would be willing to start an add-on crate using that approach.

(I am also working on flat K-D trees and R trees which could be memory-mapped directly from their on-disk representation avoiding the in-memory requirement. They do come from a simulation background though, i.e. generic N dimensional geometry. I guess one would want to specialize on Hilbert-packed two-dimensional trees for geographic search.)

nolotz commented 2 months ago

Hi everyone,

@adamreichold , have you or anyone else made any progress on this topic?

Swoorup commented 1 month ago

Point search specific methods are easier to implement since they don't require specialized trees and they also perform better. It would make sense to implement point search first as it is probably most commonly used, and implement indexing of shapes at a latter time.

Steps to index geo coordinates (points):

  1. Convert coordinates form double floating points to integers. If they are converted to 32bit (by multiplying with 10,000,000) we get resolution of about 1cm. OSM stores coordinates as 32 bit integers and Solr does too.
  2. Bit-interleave latitude and longitude into a single 64 bit integer. This value is the Morton code of a Z-order curve. (There are other space filling curves, like Hilbert curve, that give better spatial locality, but due to additional computational complexity it is questionable, if they would provide better performance overall. Morton code calculations are simpler and can also be accelerated with SIMD.)
  3. Index that 64-bit integer to any tree that can perform 1D range search (eg Btree)

Steps to query are a bit more complex:

  1. Get AABB (or any other shape) of the area we want to query
  2. Calculate where the Z-curve intersects with the AABB edges to get ranges we need to query. This is the complex part. Some names of these functions are BigMin/LitMax, NextJumpIn, GetNextZ-Address. In a more recent paper[1] I saw a method of using quadtrees to calculate these regions, and if I remember correctly, it also performed a bit better.
  3. Merge ranges into longer continuous ranges where possible.
  4. Query ranges to get the Morton codes of the points inside that area.
  5. Since, ranges intersecting with the query-shape can cover a bit larger area that we need (in order too keep the number of ranges to a sensible amount), we need to filter resulting points, to exclude points that falls outside of exact query-area. Morton codes can be converted back to latitude and longitude coordinates in order to perform that filtering.

[1] I can look-up the name of paper using quadtrees to calculate ranges, if anyone is interested.

what is the name of the paper, I am interested.

Jure-BB commented 1 month ago

@Swoorup

It's Using a Space Filling Curve for the Management of Dynamic Point Cloud Data in a Relational DBMS (2016)

There's also a source code: https://github.com/stpsomad/DynamicPCDMS

https://isprs-annals.copernicus.org/articles/IV-2-W1/107/2016/isprs-annals-IV-2-W1-107-2016.pdf

Slides: https://pdfs.semanticscholar.org/287b/2fdad23e6110d4a524efcd87c70f8696a15a.pdf

Another paper I also found interesting/helpful is: Towards a Painless Index for Spatial Objects

Swoorup commented 1 month ago

@Swoorup

It's Using a Space Filling Curve for the Management of Dynamic Point Cloud Data in a Relational DBMS (2016)

There's also a source code: https://github.com/stpsomad/DynamicPCDMS

https://isprs-annals.copernicus.org/articles/IV-2-W1/107/2016/isprs-annals-IV-2-W1-107-2016.pdf

Slides: https://pdfs.semanticscholar.org/287b/2fdad23e6110d4a524efcd87c70f8696a15a.pdf

Another paper I also found interesting/helpful is: Towards a Painless Index for Spatial Objects

Thanks. One thing that is not immediately obvious to me, is storing not just points but objects with AABB. The literature around appears to be either storing all the morton points that intersects against the AABB perhaps directly as points or morton ranges.