Indexing slows down with large numbers, e.g. unixtime numbers.

oramasearch / orama

🌌 A complete search engine and RAG pipeline in your browser, server or edge network with support for full-text, vector, and hybrid search in less than 2kb.

https://docs.orama.com

Other

8.61k stars 290 forks source link

Indexing slows down with large numbers, e.g. unixtime numbers. #428

Closed hastebrot closed 9 months ago

hastebrot commented 1 year ago

I have 2000 documents with text that each are not larger than 65536 bytes. When I add just a single field with a number, e.g. using createdAt: new Date(doc.createdAt).getTime(), it slows down insert() considerably. interestingly, low numbers are fine.

indexing of these documents with number: 14 seconds, without numbers: 3 seconds.

hastebrot commented 1 year ago

As a work-around I can use strings instead with createdAt: new Date(doc.createdAt).toISOString(),.

when I do a search and sort the ISO dates it will return the same result as with unixtime:

  const result3 = await search(docs, {
    sortBy: {
      property: "createdAt",
      order: "DESC",
    },
    limit: 100,
  });

micheleriva commented 1 year ago

I believe I am the one to blame for this. I opted for implementing AVL trees to improve performance at search time, but slowing down index time too much.

I will take some time to test better alternatives (maybe red-black trees?) to see if we can also improve indexing time.

hastebrot commented 1 year ago

A practical, performant, and easy-to-implement alternative to AVL and RB trees are Zip Trees [1] by Robert Tarjan. Maybe they are useful.

[1] Tarjan, R. E., Levy, C., & Timmel, S. (2021). Zip trees, https://arxiv.org/abs/1806.06726

micheleriva commented 1 year ago

Thanks a lot, @hastebrot! I'll look into this next week.

micheleriva commented 1 year ago

Started working on it: https://github.com/oramasearch/orama/pull/430

rocwang commented 1 year ago

With v1.1.1, I see the same issue if one indexed document field is typed as number.

I have about 80k documents. Each document is pretty small and has a popularity field indicating a popularity score.

If I leave it as number, the indexing time via insertMultiple would be about 25s.
If I convert it to string padded with 0 on the left, the indexing time would just be about 3 to 4 seconds.

As per my experiment, it looks like the issue is not related to how big the number is though.

As I intended to do indexing in browser, this issue was almost a deal-breaker for me until I found the workaround in this page. I really wish #430 would fix it.

robertpiosik commented 10 months ago

I found out that what slows down indexing is the distance between numbers, in my tests 20k objects with timestamps spanning 10 years indexes properly whereas 2 years makes orama unusable. Another example is indexing autoincremented ids.

robertpiosik commented 9 months ago

Hi @micheleriva, any progress on this? 🙏

allevo commented 9 months ago

We discovered an issue that slows the insertion when the numbers are ascendingly distributed. We fixed it, and the last release includes that fix. Could you try with the latest version?

robertpiosik commented 9 months ago

@allevo works like a charm, thanks!

allevo commented 9 months ago

Perfect! Happy to hear!

I'm closing this issue; thanks for your patience.