masumsoft / express-cassandra

Cassandra ORM/ODM/OGM for NodeJS with support for Apache Cassandra, ScyllaDB, Datastax Enterprise, Elassandra & JanusGraph.
http://express-cassandra.readthedocs.io
GNU Lesser General Public License v3.0
232 stars 68 forks source link

multiple gte/lte query #144

Closed measwel closed 6 years ago

measwel commented 6 years ago

Dear Masum,

I have reached a very interesting point in the development, namely doing geospatial search based on integer geohashes. The geohashes are 52 bit integers ( max of javascript ). For a given center point and radius I can calculate the following ranges:

[ 3672437556248576, 3672506275725312 ],
[ 3672574995202048, 3672918592585728 ],
[ 3672987312062464, 3673056031539200 ],
[ 3673193470492672, 3673330909446144 ]

The center point is: 3672837726667608

To find points within these boundaries, all that is needed is to find geohash locations within the array lower and upper bounds for all 4 entries.

My code:

    for (var i = 0; i < ranges.length; i++) {
      const min = database.models.datatypes.Long.fromString(ranges[i][0].toString());
      const max = database.models.datatypes.Long.fromString(ranges[i][1].toString());
      entity_data.query['geohash'] = {'$gte': min, '$lte': max};
    }

UPDATE 1

The integer geohash query does seem to work! I am storing geohashes as bigint in cassandra and using sasi indexes on relevant fields:

  custom_indexes: [
    {// enable search by radius and geohash
      on: 'geohash',
      using: 'org.apache.cassandra.index.sasi.SASIIndex',
      options: {
        'mode': 'SPARSE'
      }
    },
    {// enable search on category
      on: 'category',
      using: 'org.apache.cassandra.index.sasi.SASIIndex',
      options: {
        'mode': 'PREFIX'
      }
    },
    {// enable full text search sasi index on description field
      on: 'description_en',
      using: 'org.apache.cassandra.index.sasi.SASIIndex',
      options: {
        'mode': 'CONTAINS',
        'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
        'analyzed': 'true',
        'tokenization_enable_stemming': 'true',
        'tokenization_locale': 'en',
        'tokenization_skip_stop_words': 'true',
        'tokenization_normalize_lowercase': 'true'
      }
    },
]

I am executing multiple queries, namely: ranges * categories. Ranges is mostly 4, sometimes it can be 5. Categories is from 1 to 9 currently. After all queries are ready, I merge the results and return them.

Execution seems to be fast enough to make it practical. Searching through 100.000 posts for all 9 categories, aggregating and returning 507 results takes about 290 ms. Returning 1000 results takes about 600 ms. The query becomes faster when I set a lower limit for the number of max results per query. I do not need more than 200 results, so this seems a workable solution. :)

UPDATE 2

This is not a viable solution after all. There seem to be 2 problems:

  1. Sparse index can index up to 5 records per unique value. However, I did not notice this problem in practice and could index and return much more records per indexed value.

  2. The real blocker is that performance degrades when the search radius becomes smaller. With sufficiently small radius, the db starts to break down with errors. I have no explanation for this behaviour, but the conclusion is clear; sparse indexing of geohash bigints does not work reliably.