shibukawa / oktavia

Full text search engine for JS environments
MIT License
70 stars 11 forks source link

Position shifts of the underline of hit words #23

Open tuchida opened 10 years ago

tuchida commented 10 years ago

I made search index in the way, and tried search for 'いう'. However underlined in 'うえ'.

I think that the margin equals number of times call addEndOfBlock.

// simulation of HTMLParser
var oktavia = new Oktavia();
var section = oktavia.addSection('section');
oktavia.addBlock('tag');
oktavia.addEndOfBlock();
oktavia.addWord('あいうえお');
section.setTail('/hoge.html');
oktavia.build();

// simulation of OktaviaSearchRuntime#search
var queryParser = new QueryStringParser();
var queries = queryParser.parse('いう');
var results = oktavia.search(queries);
console.log(results.result.units[0].positions); // { '2': { word: 'いう', position: 2, stemmed: true } }
console.log(results.result.units[0].startPosition); // 0

content does not include Oktavia.eob.

var content = oktavia.getPrimaryMetadata().getContent(results.result.units[0].id);
console.log(content, content.length); // あいうえお 5

HTMLParser executed as follows:

  1. Call addEndOfBlock on start tag.
  2. Call addWord on text node.

Which is wrong?

tuchida commented 10 years ago

For example, if replace Oktavia.eof by space, position is not shifted. https://github.com/tuchida/oktavia/compare/replace_eob_to_space

tuchida commented 10 years ago

It have already replaced Oktavia.eof by space. https://github.com/shibukawa/oktavia/blob/8ed26702c90994b621dc426dd16f84debe728799/src/oktavia-web-runtime.jsx#L269

Then I think this is better than. https://github.com/tuchida/oktavia/compare/content_with_eob

tuchida commented 10 years ago

The regular expression do not have global flag, so I added.