quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.08k stars 670 forks source link

Terms #1069

Closed ahcm closed 3 years ago

ahcm commented 3 years ago

I opened a bug report in tantivy-cli because it the benchmark command printed 4 header columns but only 3 values. The missing column was num_terms. Which lead me to add code to print the Terms (which I was told on gutter is different to the original meaning). I as surprised to see 4 Terms with my test query "foobar" to print identical "foobar". I either expected only 1 term or for different ones like "Foobar", "FooBar".

      for _ in 0..num_repeat {
          for query_txt in &queries {
              let query = query_parser.parse_query(&query_txt).unwrap();
              let mut timing = TimerTree::default();
              let (_top_docs, count) = {              
                  let _search = timing.open("search");
                  searcher LeasedItem<Searcher>                                                                               
                      .search(&query, &(TopDocs::with_limit(10), Count)) Result<(Vec<(f32, DocAddress)>, usize), TantivyError>
                      .map_err(|e| {                                                         
                          format!("Failed while searching query {:?}.\n\n{:?}", query_txt, e)
                      })?
              };                             
              **use std::collections::BTreeSet;                  
              let mut terms_set: BTreeSet<_> = BTreeSet::new();
              query.query_terms(&mut terms_set);
              use tantivy::Term;                                 
              let terms: Vec<&Term> = terms_set.iter().collect();                              
              for term in terms {println!("{:?}", String::from_utf8_lossy(term.value_bytes()))}               
              println!("{}\t{}\t{}\t{}", query_txt, terms_set.len(), count, timing.total_time());**             
          }                                                                                                   
      }
fulmicoton commented 3 years ago

Term is a value of a token targetting a specific field. By default the query parser targets all of indexed fields. (this is configurable)

You probably have 4 fields that are indexed.

You can get the field via term.field(). The name of the field can then be fetched via schema.get_field_name(field)

ahcm commented 3 years ago

Thanks!

That gives me indeed one term per field: uri:foobar title:foobar body:foobar date:foobar

A query of uri or date with uri:foobar or date:foobar gives no matches though. Why do they still turn up in the Terms list for these fields?

fulmicoton commented 3 years ago

They should show up yes. Can you share your schema?

ahcm commented 3 years ago

So if I want to get all Terms that are present in a field, do I have to iterate over the Terms checking for a match in that field?

"schema": [
    {  
      "name": "uri",
      "type": "text",
      "options": {   
        "indexing": {
          "record": "basic",
          "tokenizer": "raw"
        },                  
        "stored": true      
      }   
    },                
    {  
      "name": "title",
      "type": "text",
      "options": {     
        "indexing": {
          "record": "position",                                                                                                   
          "tokenizer": "en_stem" 
        },
        "stored": true
      }
    },
    {
      "name": "body",
      "type": "text",
      "options": {
        "indexing": {
          "record": "position",
          "tokenizer": "en_stem" 
        },
        "stored": true
      }
    },
    {
      "name": "date",
      "type": "text",
      "options": {
        "indexing": {
          "record": "basic",
          "tokenizer": "raw"
        },
        "stored": true
      }
    }
  ],