matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.79k stars 277 forks source link

fulltext bug fixes, performance improvement and support json_value parser #20269

Open cpegeric opened 6 hours ago

cpegeric commented 6 hours ago

What type of PR is this?

Which issue(s) this PR fixes:

issue #20217 #20213 #20175 #20149

What this PR does / why we need it:

bug fixes for #20217 #20213 #20175

  1. limit the batch size to 8192 on both fulltext_index_scan() and fulltext_tokenize() function
  2. In fulltext_index_scan function, create a new thread to evaluate the score in 8192 documents per batch instead of waiting for all results from SQL. It will speed up and avoid OOM in the function. However, the score will be calculated based on each mini-batch instead of complete batch. I think it doesn't matter as long as we have the correct answer.
  3. support json_value parser
  4. Pre-allocation of memory in fulltext_tokenize() function to avoid malloc
  5. add monpl tokenizer repo to matrixone
  6. bug fix json tokenizer to truncate value and increase the limit to 127 bytes