mosuka / blast

Blast is a full text search and indexing server, written in Go, built on top of Bleve.
Apache License 2.0
1.08k stars 76 forks source link

ignore fields for index #24

Closed klausondrag closed 5 years ago

klausondrag commented 5 years ago

Hi,

in an attempt to reduce the index size, I preprocessed the data. However, when I will use the API, I want to get the human readable data back so I can show it to the user. Is there a way to exclude fields when building an index? I tried to use "x" for the preprocessed field and "_x" for the original text. Unfortunately, this increased the index by a lot so I believe the field starting with "_" was not excluded. Is there a way to do this? My only other idea is to build wrapper API which stores the text to an ID in a dictionary and then returns that. But that seems like it should already be supported.

mosuka commented 5 years ago

@klausondrag I returned to this project. Sorry for my late reply. Could you please give me more detailed examples. For example, current your index mapping, etc. Thanks,

klausondrag commented 5 years ago

Not a problem. My original issue was the following: This is a example result from the Readme.

{
  "_type": "enwiki",
  "contributor": "unknown",
  "text_en": "A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information which must be consulted, akin to other techniques for managing information overload. The most public, visible form of a search engine is a Web search engine which searches for information on the World Wide Web.",
  "timestamp": "2018-07-04T05:41:00Z",
  "title_en": "Search engine (computing)"
}

Because I want to use decrease the index size and generalize better, I pre-process the text. I removed stop words and stemmed the remaining words. So the resulting field "text_en" becomes:

search engin inform retriev system design help inform store comput system . The search result usual present list commonli hit . Search engin help minim requir inform amount inform consult , akin techniqu manag inform overload . The public , visibl search engin Web search engin search inform World Wide Web .

Originally, I wanted to expose the REST-API from blast to the internet which caused the following: I need blast to search in the processed field BUT I need to get the original field returned because this text will be displayed to the user. So I was looking for a way to add both texts into blast but only add the processed fields to the index. I know I could add both to the index and only search in one but that increases the index size too much. Ideally, I wanted something like this:

{
  "_type": "enwiki",
  "contributor": "unknown",
  "_text_en": "A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information which must be consulted, akin to other techniques for managing information overload. The most public, visible form of a search engine is a Web search engine which searches for information on the World Wide Web.",
  "text_en": "search engin inform retriev system design help inform store comput system . The search result usual present list commonli hit . Search engin help minim requir inform amount inform consult , akin techniqu manag inform overload . The public , visibl search engin Web search engin search inform World Wide Web ."
  "timestamp": "2018-07-04T05:41:00Z",
  "title_en": "Search engine (computing)"
}

where the fields leading with an underscore ("_text_en") do not get added to the index but do get returned.

In the end, I solved my problem by only adding preprocessed fields and setting up Redis to get the original text based on a key. I created a wrapper API in python which first connects to Blast, then looks up the resulting keys in Redis and then returns this. This works fine because I need the Redis instance at other places as well.

Now that I think of it, maybe it could have been solved with the index-mapping-file?

mosuka commented 5 years ago

Hi @klausondrag ,

index = true creates index data and makes the field searchable (and sortable into a face table). index = false does not create index data, so it can not be searched.

store = true means that the original text is stored in the index, so you can retrieve fields at the search time. store = false does not store the original text in the index, so you can not retrieve fields at search time.

...
        "_text_en": {
          "enabled": true,
          "dynamic": true,
          "fields": [
            {
              "type": "text",
              "analyzer": "en",
              "store": true,
              "index": false,
              "include_term_vectors": false,
              "include_in_all": false
            }
          ],
          "default_analyzer": "en"
        },
        "text_en": {
          "enabled": true,
          "dynamic": true,
          "fields": [
            {
              "type": "text",
              "analyzer": "en",
              "store": false,
              "index": true,
              "include_term_vectors": true,
              "include_in_all": false
            }
          ],
          "default_analyzer": "en"
        },
...

It seems that you can do something similar with the above settings. How about my suggestion?

klausondrag commented 5 years ago

Hi @mosuka , this looks really great and does even more what I wanted. I will re-evaluate redis at some point. Thank you very much!