masumsoft / express-cassandra

Cassandra ORM/ODM/OGM for NodeJS with support for Apache Cassandra, ScyllaDB, Datastax Enterprise, Elassandra & JanusGraph.
http://express-cassandra.readthedocs.io
GNU Lesser General Public License v3.0
232 stars 68 forks source link

geo_point mapping #135

Closed measwel closed 6 years ago

measwel commented 6 years ago

In the docs for elassandra, the following is mentioned for mapping a geo_point:

Elasticearch Types | CQL Types | Comment geo_point | UDT geo_point or text | Built-In User Defined Type (1)

The references (1) and (2) in the documentation seem to be reversed. See: http://elassandra.readthedocs.io/en/latest/mapping.html?highlight=geo_point

For the time being, I have implemented the following:

    ormOptions: {
      // let cassandra-express create the elastic search index
      manageESIndex: true,
      // user defined data types
      udts: {
        geopoint: {
          'lat': 'float',
          'lon': 'float'
        },
      }

And in my model I have:

  ....
    location: {
      type: 'frozen',
      typeDef: '<geopoint>'
    }
  },
  key: ["useruuid", "created"],
  clustering_order: {"created": "desc"},
  es_index_mapping: {
    discover: '.*',
    properties: {
      // make sure description is analyzed in the index
      "description": {
        "type": "text",
        "index": "analyzed"
      },
      "location": {
        "type": "geo_point"
      }
    }
  }

The index is created, but the mapping is not right. The description field in the index is mapped to a keyword instead of analyzed text. The location field is also not mapped to a geo_point like specified.

                    "description": {
                        "type": "keyword",
                        "cql_collection": "singleton"
                    },

UPDATE:

It seems that the mapping is created correctly, if I omit the fields in the cassandra table and only specify them in the es index properties object. These fields then also get created in the cassandra table. This is confusing, as the documentation states:

To overwrite some properties, you could do the following:

This suggests, that you can have fields specified for the cassandra table AND overwrite them in the properties specification. But in my case, the mapping only works when I specify the fields only in the properties for elastic search.

masumsoft commented 6 years ago

@measwel for me the index is actually created and I didn't face any errors while searching, though the location type went into the index as a nested object instead of a geo_point type. May be you are accessing the model instances before they the getting initialized. I'm sharing my code here. Though i'm using Async suffixed promise based functions here, the callback based methods also does work. But the geo_point type defined in the properties is not having any effect on the elassandra index. Looks like it's an elassandra issue as the index properties is transferred to elassandra as is from express-cassandra.

index.js

var models = require('express-cassandra');

//Tell express-cassandra to use the models-directory, and
//use bind() to load the models using cassandra configurations.
models.setDirectory( __dirname + '/models').bind(
    {
        clientOptions: {
            contactPoints: ['127.0.0.1'],
            protocolOptions: { port: 9042 },
            keyspace: 'mykeyspace',
            queryOptions: {consistency: models.consistencies.one},
        elasticsearch: {
              host: 'http://localhost:9200',
              apiVersion: '5.5',
          requestTimeout: 1000 * 60 * 60,
              sniffOnStart: true,
            }
        },
        ormOptions: {
            defaultReplicationStrategy : {
                class: 'NetworkTopologyStrategy',
                dc1: 1
            },
            migration: 'alter',
        manageESIndex: true,
            udts: {
              geo_point: {
                'lat': 'float',
                'lon': 'float'
              },
            }
        }
    },
    function(err) {
        if(err) throw err;

    console.log('model sync done!');
    var Person = models.instance['Person'];
    var person = new Person({
      useruuid: models.uuid(),
      name: 'Masum',
      age: 32,
      location: {
        lat: 23.233441,
        lon: -90.244535
      },
      created: { $db_function: 'toTimestamp(now())' }
        });
    person.saveAsync()
      .then(() => {
        return Person.searchAsync({q: 'name: Masum'});
          })
      .then((response) => {
        console.log(response.hits.hits);
        models.close();
      })
      .catch((err) => {
        console.log(err);
        models.close();
      });
    }
);

models/PersonModel.js

module.exports = {
    fields:{
    useruuid: 'uuid',
        name    : 'text',
        age     : 'int',
    location: {
          type: 'frozen',
          typeDef: '<geo_point>'
        },
        created : 'timestamp'
    },
    key:['name', 'created'],
    clustering_order: { created: 'desc' },
    es_index_mapping: {
      discover: '.*',
      properties: {
        name: {
          "type": "text",
          "index": true
        },
        location: {
          "type": "geo_point"
        }
      }
    }
}
measwel commented 6 years ago

The problem is:

  1. When you specify a field in the cassandra table, this field gets mapped to the index and ES properties are ignored! That is why our location field gets mapped as a nested object.

  2. When you omit the field in the cassandra table, the field gets mapped from the properties and a field is ALSO created in the cassandra table. Try omitting location field in table definition. Then you will get a proper geo_point field in the index and in the table you will get: location list<frozen> !

I am trying to figure out how to deal with this. For description field if I omit it in table definition, I get description list ( list of keywords it seems ). But I want to store whole text here and have the keywords in ES only. On the other hand, if I include description in the table, then in ES index I get it mapped as 'keyword' instead of analyzed text...

measwel commented 6 years ago

Another problem I am facing is that elastic keeps dropping connections: HEAD http://localhost:9200/myindex => socket hang up

If anybody knows how to remedy this problem, I would like to hear it.

measwel commented 6 years ago

I currently have a table in the db with the following fields:

CREATE TABLE arbotti.post (
   ...
    description text,
    descriptionkeywords list<text>,
    location list<frozen<geo_point>>,
   ...
    PRIMARY KEY (useruuid, created)
) ...

Which is the result of this mapping:

    properties: {
      "descriptionkeywords": {
        "type": "text",
        "index": true
      },
      "location": {
        "type": "geo_point"
      }
    }

The question is: how do I insert into these cassandra fields:

    descriptionkeywords list<text>
    location list<frozen<geo_point>>

This is the data I provide to be saved in the cassandra table:

{ ...
  description: 'Some text',
  descriptionkeywords: [ 'Some text' ], <-- array with 1 text value
  location: [ { lat: 52.582014799999996, lon: 13.4913122 } ] <-- array with 1 geopoint
}

But the fields descriptionkeywords and location are not saved in the cassandra table. They are also not saved in the ES index.

I believe that the mapping is correct. The location field has to be a true geo_point in the ES index, otherwise geospatial search will not work. The same is true for the text description which has to be analyzed text. The fields are there, but I cannot insert into them with the cassandra-express driver. I would like to avoid writing to the index directly if possible.

measwel commented 6 years ago

I have thought about all this and I seriously consider not using elassandra. I need to be able to perform full text search and geospatial search and that is the only reason for wanting the ES index in the first place.

I am considering to implement full text search with a SASI index in cassandra and geospatial search by using geohashes, also in cassandra. If I can do this, then I will have no need for ES and I can use only the cassandra DB with the excellent cassandra-express driver ( for which I thank you from my hearth ! ).

UPDATE 1:

  const query = `CREATE CUSTOM INDEX post_description_idx ON arbotti.post(description)
            USING 'org.apache.cassandra.index.sasi.SASIIndex'
            WITH OPTIONS = {
                'mode': 'CONTAINS',
                'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
                'analyzed': 'true',
                'tokenization_enable_stemming': 'true',
                'tokenization_locale': 'en',
                'tokenization_skip_stop_words': 'true',
                'tokenization_normalize_lowercase': 'true'
            };`;

Works fine on the latest bitnami/cassandra image. :) Looks like full text search will be possible.

Now for the geospatial search...

measwel commented 6 years ago

Good day,

I am facing a problem when using the ORM cassandra-express. Each time I restart the server, the driver runs queries to create tables and indexes, even if the data model has not changed. These queries doe not work if I do not drop the keyspace beforehand. Should the driver really execute create queries if the model has not changed? I have tried different values for migration, but they seem to have no effect. How can I make sure the queries do not run when not needed?

masumsoft commented 6 years ago

When you restart your app, the way for express-cassandra to know whether anything has changed is to query cassandra to read the current db state and match that state with your model schema. If it finds nothing to have changed, then it doesn't do any migration. So it shouldn't run create queries if nothing has changed. It should only run some select queries against the system tables to read the current db state.

Could you share your case where it's happening? I mean the model schema and how did you initialize it?

measwel commented 6 years ago

The problem could be in my code. I see now that my code is not executing, because I emit an event only when the database is updated. When the db is not updated, the event is not emitted and the rest of my code does not run. Is there a way to test whether the update has taken place or not?

The function I use looks like this:

module.exports.updateDatabase = function (messageBus) {
  logger.log('info', 'Starting arbotti database update');
  models.setDirectory(path.join(__dirname, 'models')).bind(
  { ... and so on ...
masumsoft commented 6 years ago

So you want to programatically test whether any db schema was updated in the model sync operation? Right now express cassandra does not pass this information into the schema sync callback, but it should be fairly straight forward to implement that in express-cassandra. I'll try to make the necessary changes and include the updates in the next release.

measwel commented 6 years ago

Coming to think of it; it would be useful to know if a db schema was updated, but the foremost requirement for me is to know when the routine has finished executing (updating the db or not). When I run this:

models.setDirectory(path.join(__dirname, 'models')).bind(
  {
    clientOptions: {
      contactPoints: [config.contactPoints],
      protocolOptions: {port: config.cassandraPort},
      keyspace: config.keyspace,
      queryOptions: {consistency: models.consistencies.one},
      authProvider: new models.driver.auth.PlainTextAuthProvider(config.cassandra_user, config.cassandra_password)
    },
    ormOptions: {
      // If your keyspace doesn't exist it will be created automatically
      // using the default replication strategy provided here.
      createKeyspace: true,
      defaultReplicationStrategy: {
        'class': 'NetworkTopologyStrategy',
        'DC1': 1
      },
      migration: 'alter'
    }
  },
  function (err) {
    if (err) {
      logger.log('error', 'Error updating database!', err);
    } else {
       messageBus.emit('database_updated'); // send message that db is ready
          }
      }
);

The event signalling the end of the db update operation is never emitted, because the whole migration routine is not run if there are no db changes. I wonder if there is another way to find out when the routine above has finished executing.

measwel commented 6 years ago

Dear Masum,

Oftentimes when the schema has changed and the database update routine is executed I get the following error:

stty: when specifying an output style, modes may not be set /node_modules/readline-sync/lib/read.sh: 49: /node_modules/readline-sync/lib/read.sh: cannot create /dev/tty: No such device or address stty: 'standard input': Inappropriate ioctl for device Error: The current environment doesn't support interactive reading from TTY. stty: when specifying an output style, modes may not be set

When faced with this error, I need to drop the schema and rerun the update routine. Is it known what causes this problem and is there a remedy?

masumsoft commented 6 years ago

Any schema migration that requires deletion of data will be confirmed from the user by asking a confirmation in the terminal. This is an extra safety feature to avoid accidental deletion of data.

Note that some environments might not support tty console, so asking the user for confirmation in the terminal may throw errors. You can set the property disableTTYConfirmation: true in the ormOptions. This will do the migrations without asking for a confirmation from you.

measwel commented 6 years ago

Hello Masum,

What would be the best way to implement transparent db field encryption in the free version of cassandra? Ideally, I would like to encrypt a field automatically when writing it to the db or querying for it and decrypt it automatically when reading it.

Should I do this on application level, or is there a better way?

masumsoft commented 6 years ago

To handle read/write you could use a virtual field to alias your original db field to do the encryption things:

// PersonModel.js
module.exports = {
    fields: {
        name_encrypted: "varchar",
        name : {
            "type": "varchar",
            "virtual" : {
                get: function() {
                    return your_decryption_logic(this.name_encrypted);
                },
                set: function(name) {
                    this.name_encrypted = your_encryption_logic(name);
                }
            }
        }
    }
}

This setup will enable your codebase to access name as a virtual field. This virtual field will get/set from/to the original db field name_encrypted and keep your name field encrypted in db while your code can access name field as an alias which is not saved to db.

To query against the name field you could do something like this:

models.instance.Person.find({ name_encrypted: your_encryption_logic(value) }, function(err, person){

});
measwel commented 6 years ago

Thank you very much Masum !

masumsoft commented 6 years ago

You're welcome. Closing this thread as it already started to contain discussions not directly related to the title. Let's create a new issue / stack overflow question to discuss another topic.

Extarys commented 6 years ago

I love this approach, thanks!