richardwilly98 / elasticsearch-river-mongodb

MongoDB River Plugin for ElasticSearch
1.12k stars 215 forks source link

mapping #75

Closed andriuwe4ka closed 11 years ago

andriuwe4ka commented 11 years ago

is there a way to use my own mapping, not default? for example: i have collection with many fields in document and i really need only two of them in index. so i suppose that indexing two fields will take less time than 15-20 fields.

i found forked version of your river (https://github.com/gustavonalle/elasticsearch-river-mongodb), but i have small interest in forked version because of support, etc.

so is there a way to write my own mapping for river?

i found issue #64 here, with howto (creating index in elasticsearch first and creating river after) but this appear not working because mapping shows all fields after river creation =(

so any comments?

andriuwe4ka commented 11 years ago

so in addition creating index: curl -XPUT "localhost:9200/aaa" -d '{"settings":{"number_of_shards":1,"mapper":{"dynamic":false}},"mappings":{"main":{"properties":{"title":{"type":"string"}}}}}' creating river: curl -XPUT "localhost:9200/_river/aaa/_meta" -d '{"type":"mongodb","mongodb":{"db":"aaa_db","collection":"aaa_collection"},"index":{"name":"aaa","type":"main"}}'

after this mapping is what i want: only title field, but after adding any record to aaa_collection (river using oplog ads it to ES) mapping is already with all fields =(

richardwilly98 commented 11 years ago

Hi,

Index and mapping can be created before the river. Does you custom mapping include all fields? Can you please provide a gist to reproduce your issue?

Thanks, Richard.

andriuwe4ka commented 11 years ago

i'll ask about providing, but i think it's impossible =)

and to reproduce:

mongo collection: {"title":string, "description":string, "timestamp":long ...and as many as you wish} i need to index only title and description (to begin with - title) so (no indexes yet)

curl -XPUT "localhost:9200/aaa" -d '{"settings":{"number_of_shards":1,"mapper":{"dynamic":false}},"mappings":{"main":{"properties":{"title":{"type":"string"}}}}}'

then river time

curl -XPUT "localhost:9200/_river/aaa/_meta" -d '{"type":"mongodb","mongodb":{"db":"aaa_db","collection":"aaa_collection"},"index":{"name":"aaa","type":"main"}}'

if i get mapping: curl -XGET "localhost:9200/aaa/main/_mapping?pretty=true"

it'll show: { "main" : { "properties" : { "title" : { "type" : "string" } } } }

then i add something to collection (all fields are not empty)

and index will use all of them and mapping will change to track all of fields

richardwilly98 commented 11 years ago

Hi,

One of my question was "Does your custom mapping include all fields?"

So in you scenario we will need to remove the unwanted attributes using a script.

Look at the first example in "Script Filters" section [1].

[1] - https://github.com/richardwilly98/elasticsearch-river-mongodb

Thanks, Richard.

andriuwe4ka commented 11 years ago

understood, thanks a lot :+1: will try now =)

andriuwe4ka commented 11 years ago

ok, script working and in mapping only those fields that i need =) but source in ES is smaller too (all fields are deleted - nice)

so what to do: get only id or is there some command (like delete or ignore) to store fields in ES but not indexing them?

richardwilly98 commented 11 years ago

Look at custom mapping in Elasticsearch [1].

[1] - http://www.elasticsearch.org/guide/reference/mapping/source-field/

andriuwe4ka commented 11 years ago

yes, i saw it, but i mentioned not this: as in script appears "delete ctx.document.timestamp;" this field is no more available in _source =(

richardwilly98 commented 11 years ago

Sorry but I am not sure to understand your issue.

Thanks, Richard.

enrique-fernandez-polo commented 11 years ago

Hi

I am experimenting a similar problem. I have a field that I don't want to be analyzed so I first create the index with the mapping information

PUT http://localhost:9200/users

{
    "mappings" : {
        "default" : {
            "properties" : {
                "nickname" : { "type" : "string", "index" : "not_analyzed" }
            }
        }
    }
}

And then I create the river

PUT http://localhost:9200/_river/users/_meta

{
    "type": "mongodb",
    "mongodb": {
        "servers": [
            {
                "host": "127.0.0.1",
                "port": 27017
            }
        ],
        "options": {
            "secondary_read_preference": true,
            "drop_collection": true
        },
        "db": "users",
        "collection": "userApplication"
    },
    "index": {
        "name": "users",
        "type": "default"
    }
}

I am loosing the mapping configuration so the nickname field is analyzed and search results are not the desired ones

richardwilly98 commented 11 years ago

Hi,

Try to disabled dynamic mapping [1]. See example here [2].

Please let me know how it goes.

[1] - http://www.elasticsearch.org/guide/reference/mapping/dynamic-mapping/ [2] - https://gist.github.com/radu-gheorghe/4737210

Thanks, Richard.

enrique-fernandez-polo commented 11 years ago

Hello again,

I am loosing any kind of configuration of the index and the mappings when creating the river. Now I create the index like this

PUT http://localhost:9200/users

{"settings" : {
        "mapper" : {
            "dynamic": false
        }
    },
    "mappings": {
        "default": {
            "properties": {
                "_class": {
                    "type": "string"
                },
                "applicationIdentifier": {
                    "type": "string"
                },
                "creationDate": {
                    "type": "date",
                    "format": "dateOptionalTime"
                },
                "nickname": {
                  "type": "string", "index":"not_analyzed"
                },
                "password": {
                    "type": "string"
                },
                "userCustomInformation": {
                    "type": "object"
                }
            }
        }
    }
}

For make sure that the index is correct created I ask for the index settings

GET http://localhost:9200/users/_settings

{
    "users": {
        "settings": {
            "index.number_of_shards": "5",
            "index.number_of_replicas": "1",
            "index.version.created": "900099",
            "index.mapper.dynamic": "false"
        }
    }
}

And after creating the river like in my last comment, the index configuration is lost

{
    "users": {
        "settings": {
            "index.number_of_shards": "5",
            "index.number_of_replicas": "1",
            "index.version.created": "900099"
        }
    }
}

I've tried to close the index, disable the dynamic mapping and reopening it but I have the same issue. The mapping configuration is also lost.

richardwilly98 commented 11 years ago

Hi,

In the scenario above index settings return:

curl -XGET "http://localhost:9200/index75/_settings?pretty=true"
{
    "index75" : {
        "settings" : {
            "index.mapper.dynamic" : "false",
            "index.number_of_shards" : "5",
            "index.number_of_replicas" : "1",
            "index.version.created" : "900099"
        }
    }
}
richardwilly98 commented 11 years ago

Any update?

richardwilly98 commented 11 years ago

@andriuwe4ka I will close this issue due to inactivity. Please reopen it if needed.

abishekk92 commented 10 years ago

Hi @richardwilly98 , I am facing a similar problem, wherein I want to index only a subset of the fields and store the entire json using {_source : { enabled : true}}

So I create my mappings only for the field I want to index.

     {"mappings": {
             "facebook" : {
                 "dynamic" : "strict",
                "properties" : {
                    "post_id" : {
                        "type" : "string",
                        "store" : True,
                        "index" : "not_analyzed",
                        },
                    "text" : {
                        "type" : "string",
                        "store" : True,
                        "index" : "analyzed",},
                    "message" : {
                        "type" : "string",
                        "store" : True,
                        "index" : "analyzed",},
                    "brand_id" : {"type" : "integer"},
                    },
                },
            },
         }

And now when I try to create the river with the following config

    payload = {"type" : "mongodb",
               "mongodb": {
                   "db" : db,
                   "collection" : collection,
                   "secondary_read_preference" : True,
                   },
               "index" : {
                   "name" : index_name,
                   "type" : doc_type,
                   },
               }

Following is the StackTrace I get while trying to create a river with above config

org.elasticsearch.index.mapper.StrictDynamicMappingException: mapping set to strict, dynamic introduction of [fb_post_type] within [facebook] is not allowed
        at org.elasticsearch.index.mapper.object.ObjectMapper.parseDynamicValue(ObjectMapper.java:628)
        at org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:618)
        at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:469)
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:515)
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:462)
        at org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:392)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:394)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:153)

I tried using include_fields as suggested however, that leads to rest of the fields, i.e the fields I don't have mappings for not be stored as a part of _source

And also going by applyAdvancedTransformation, the document with the transformation applied is sent to updateBulkRequest, so naturally the fields would be lost here too and can't be set in the _source, even if I delete the unnecessary fields using the script filter

It would be great if you can suggest how I can get mongodb-river to work wherein I can only index the the fields, I have mapping for and store the remaining fields.

abishekk92 commented 10 years ago

Update: Using {dynamic : false} in my mapping worked. Here is the final mapping I'm using.

    body = {"mappings": {
             "facebook" : {
                "dynamic" : False,
                "properties" : {
                    "post_id" : {
                        "type" : "string",
                        "store" : True,
                        "index" : "not_analyzed",
                        },
                    "text" : {
                        "type" : "string",
                        "store" : True,
                        "index" : "analyzed",},
                    "message" : {
                        "type" : "string",
                        "store" : True,
                        "index" : "analyzed",},
                    "brand_id" : {"type" : "integer"},
                    },
                },
            },
         }
mocheng commented 10 years ago

I met the same problem after I upgraded to mongodb-river 2.0.0. It is really frustrating!

After trying all tricks mentioned in this thread, it doesn't work any way.

Now, I just switched back to version 1.6.8 which goes well with MongoDB at least.

richardwilly98 commented 10 years ago

@mocheng can you provide your configuration (river, mapping index)?

mocheng commented 10 years ago

@richardwilly98 My deployment version is: MongoDB 2.4.3 ElasticSearch 1.0.0 ElasticSearch-MongoDB river 2.0.0

The index mapping is created as:

curl -XPUT http://192.168.100.92:9200/beeper_v1 -d '
{
     "mappings": {
          "register": {
               "dynamic" : false,
               "properties": {
                    "score": {
                             "type": "integer"
                    },
                    "online": {
                              "type": "boolean"
                    },
                    "title": {
                             "type":"string",
                              "indexAnalyzer":"ik",
                              "searchAnalyzer":"ik"
                    },
                    "intro": {
                             "type":"string",
                              "indexAnalyzer":"ik",
                              "searchAnalyzer":"ik"
                    },
                    "area": {
                             "type":"string",
                              "indexAnalyzer":"ik",
                              "searchAnalyzer":"ik"
                    },
                    "loc" : {
                            "type" : "geo_point"
                    }
               }
          }
     }
}
'

The river is created as

curl -XPUT "192.168.100.92:9200/_river/beeper_river/_meta" -d '
{
     "type": "mongodb",
     "mongodb": {
          "servers": [
               { "host": "192.168.100.99", "port": 30000 }
          ],
          "options": { "secondary_read_preference": true },
          "db": "beeper",
          "collection": "register"
     },
     "index": {
          "name": "beeper",
          "type": "register"
     }
}
'

The MongoDB collection register has document like below:

{
    "_id" : ObjectId("52d7a72dcaff4848e11200f5"),
    "av" : "100060000",
    "basic" : {
        "tel" : "13261805201",
        "pwd" : "111111",
        "pts" : 1389864749
    },
    "crc" : 1,
    "domain_name" : "6564491425",
    "hpts" : ISODate("2014-04-09T06:02:03.983Z"),
    "loc" : [
        "4.9E-324",
        "4.9E-324"
    ],
    "login" : true,
    "mid" : 2000000010,
    "model" : "GT-I9100",
    "oc" : 2,
    "ocre" : 1,
    "online" : false,
    "os" : "Android4.1.2",
    "personal" : {
        "name" : "Zhuzhu",
        "idcard" : "371421198609111760",
        "pts" : 1389864749
    },
    "score" : 1449332708.7142856,
    "src" : 2,
    "tpts" : 1390187366,
    "unread" : 24,
    "ur" : 4.428571428571429,
    "urc" : 7,
    "work" : {
        "area" : "Sheji",
        "intro" : "婚庆abc Xjjhvdfhnmndsshjnvzstjknbcsfjjjj",
        "pts" : 1389864749,
        "title" : "婚庆"
    }
}

After the river is created, the index mapping is changed to have all fields. Unfortunately, the IK analyzer is changed to default string analyzer.

richardwilly98 commented 10 years ago

Did you try without index alias?

mocheng commented 10 years ago

@richardwilly98 It works!!!

Originally, the "beeper" is alias of "beeper_v1". After changing the "db" to "beeper_v1" from "beeper", it works!

Thank you so much!!!

curl -XPUT "192.168.100.92:9200/_river/beeper_river/_meta" -d '
{
     "type": "mongodb",
     "mongodb": {
          "servers": [
               { "host": "192.168.100.99", "port": 30000 }
          ],
          "options": { "secondary_read_preference": true },
          "db": "beeper_v1",
          "collection": "register"
     },
     "index": {
          "name": "beeper",
          "type": "register"
     }
}
'