olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.91k stars 546 forks source link

Slow Indexing Performance - to be expected? #305

Closed Knacktus closed 6 years ago

Knacktus commented 6 years ago

First time user here. Indexing (Builder.build()) performance is quite slow. I'm not sure if it's to be expected, but my gut feeling tells me that I might do something wrong.

Following the numbers, then the data and then my code:

1000 rows take 300 ms -> pretty fast 10000 rows take 10 seconds -> hmm, exponential but still OK 30000 rows take 60 seconds -> too much, especially as I expect production data as large as 100000 rows

Each row looks like this:

{
  "id": 105695,
  "prodId": "ABC",
  "lastImportTimestamp": "2017-08-31-13.19.26.668000",
  "werk": "77",
  "positionId": 6253886,
  "umfang": "GDFC",
  "takt": "DSDGSDFSDF",
  "pfeil": "10",
  "lfdnrZenta": 542016652,
  "benennung": "Name D,INNEN",
  "strukturZusatzbenennung": null,
  "produktstruktur": "CD",
  "teileGueltigkeit": "+KFISDGJSD",
  "copTeil": null,
  "seTeam": "35543 ",
  "positionBemerkung": null,
  "updateTimestamp": "2017-05-30-11.35.35.271000",
  "plattformkennzeichen": null,
  "lfdnrSort": "1166",
  "abweichungBemerkung": "laut MAL vom 12.05.2017",
  "gueltigkeitstyp": "GLOBAL",
  "gueltBereichPosId": "6253886",
  "gueltigVon": null,
  "gueltigBis": null,
  "pfzgId": null,
  "daisyNr": null,
  "daisyStatus": null,
  "endtermin": "2017-12-31 00:00:00",
  "abweichgungBenennung": "AAAA Qualifizierung",
  "aendKategorieBezeichnung": "Qualifizierung",
  "tnrVornummer": "123",
  "tnrMittelnummer": "123",
  "tnrEndgruppe": "ABC",
  "tnrIndex": null,
  "teilenummer": ".123.123.ABC.",
  "zsbKennzeichen": "-",
  "farbabhaengigkeit": "N",
  "zeichnungsDatum": "2016-08-19 00:00:00",
  "stand": null,
  "genstandNr": null,
  "genstandKennzeichen": null,
  "genstandKey": null,
  "jtFileName": null,
  "kstand": null,
  "isVisible": false
}

My code is pretty standard (I hope). I'm only indexing few fields:

    const builder = new Builder();
    builder.field("id");
    builder.field("benennung");
    builder.field("teilenummer");
    builder.field("tnrVornummer");
    builder.field("tnrMittelnummer");
    builder.field("tnrEndgruppe");
    builder.field("tnrIndex");
    builder.field("seTeam");

    builder.ref("id");

    // for (let i = 0; i < this.data.length; i++) {
    for (let i = 0; i < 10000; i++) {
      const data = this.data[i];
      builder.add(data);
    }

    // This is what takes that long
    this.index = builder.build();

I've also tried the convenience function (lunr) and convert the id to a string beforehand, but results are the same.

Any info about if this is to be expected or hints where I'm wrong are greatly appreciated.

EDIT: Attached a Chrome profile file. I don't have much experience with that, but it might help.

Profile-20171013T061452.json.gz

Cheers,

Jan

olivernn commented 6 years ago

Thanks for the profile, super useful.

I'm actually surprised by the results, it looks like the bulk of the time is spent creating the vectors used to represent the documents, specifically in the function that calculates the inverse document frequency for terms, lunr.idf. As you can see, it doesn't do very much, just some maths.

This is good and bad, the code is simpler, so there is less to understand when trying to hunt down bottlenecks, but thats also bad, as there isn't much more than basic code in there... I'd imagine that a healthy dose of caching might have a significant improvement, but that is just speculation at the moment.

I wouldn't expect indexing to be slow, I'd hope it would be roughly linear with the number of documents/terms being added.

olivernn commented 6 years ago

I've just published a new version of Lunr (2.1.4) which includes a small change to cache the calculation being made by lunr.idf during build time. In my local testing this showed a decent speed up in performance, let me know if it has a positive impact on your data set.

emirotin commented 6 years ago

I have a similar problem, but this time it's more about memory allocation fail (though also about the speed). Let me know if you'd like a separate issue for that.

I have the reduced test case. You can find it here: https://github.com/emirotin/lunr-failure. You need to untar theJSON dump file before running the code.

The error looks like this:

<--- Last few GCs --->

[60857:0x103801000]   212952 ms: Mark-sweep 1401.1 (1498.6) -> 1401.1 (1498.6) MB, 2263.3 / 0.0 ms  allocation failure GC in old space requested
[60857:0x103801000]   215324 ms: Mark-sweep 1401.1 (1498.6) -> 1401.1 (1467.6) MB, 2372.4 / 0.0 ms  last resort GC in old space requested
[60857:0x103801000]   217551 ms: Mark-sweep 1401.1 (1467.6) -> 1401.1 (1467.6) MB, 2226.3 / 0.0 ms  last resort GC in old space requested

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x1c7bd5ea5e71 <JSObject>
    2: add [/Users/eugene.mirotin/work/chgk-db-search/spider/node_modules/lunr/lunr.js:2162] [bytecode=0x1c7bb7f6ee01 offset=427](this=0x1c7b10958231 <JSObject>,doc=0x1c7b9b3e4559 <Object map = 0x1c7b58d14bb1>)
    3: /* anonymous */ [/Users/eugene.mirotin/work/chgk-db-search/spider/build-index.js:~31] [pc=0x2a1a2fb9007](this=0x1c7b1094a5b9 <Stream map = 0x1c7b76e56161>,doc=0x1c7b9b3e4559 <Objec...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [/opt/local/bin/node]
 2: node::FatalException(v8::Isolate*, v8::Local<v8::Value>, v8::Local<v8::Message>) [/opt/local/bin/node]
 3: v8::Utils::ReportOOMFailure(char const*, bool) [/opt/local/bin/node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [/opt/local/bin/node]
 5: v8::internal::Factory::NewUninitializedFixedArray(int) [/opt/local/bin/node]
 6: v8::internal::(anonymous namespace)::ElementsAccessorBase<v8::internal::(anonymous namespace)::FastHoleyObjectElementsAccessor, v8::internal::(anonymous namespace)::ElementsKindTraits<(v8::internal::ElementsKind)3> >::ConvertElementsWithCapacity(v8::internal::Handle<v8::internal::JSObject>, v8::internal::Handle<v8::internal::FixedArrayBase>, v8::internal::ElementsKind, unsigned int, unsigned int, unsigned int, int) [/opt/local/bin/node]
 7: v8::internal::(anonymous namespace)::ElementsAccessorBase<v8::internal::(anonymous namespace)::FastHoleyObjectElementsAccessor, v8::internal::(anonymous namespace)::ElementsKindTraits<(v8::internal::ElementsKind)3> >::GrowCapacityAndConvertImpl(v8::internal::Handle<v8::internal::JSObject>, unsigned int) [/opt/local/bin/node]
 8: v8::internal::(anonymous namespace)::FastElementsAccessor<v8::internal::(anonymous namespace)::FastHoleyObjectElementsAccessor, v8::internal::(anonymous namespace)::ElementsKindTraits<(v8::internal::ElementsKind)3> >::AddImpl(v8::internal::Handle<v8::internal::JSObject>, unsigned int, v8::internal::Handle<v8::internal::Object>, v8::internal::PropertyAttributes, unsigned int) [/opt/local/bin/node]
 9: v8::internal::JSObject::AddDataElement(v8::internal::Handle<v8::internal::JSObject>, unsigned int, v8::internal::Handle<v8::internal::Object>, v8::internal::PropertyAttributes, v8::internal::Object::ShouldThrow) [/opt/local/bin/node]
10: v8::internal::Runtime::SetObjectProperty(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::LanguageMode) [/opt/local/bin/node]
11: v8::internal::Runtime_SetProperty(int, v8::internal::Object**, v8::internal::Isolate*) [/opt/local/bin/node]
12: 0x2a1a2e0463d
13: 0x2a1a2eede06
[1]    60857 abort      node build-index.js

Interesting enough:

So in both cases multi-language support enabled makes it more robust?

olivernn commented 6 years ago

@emirotin interesting, also thanks for the reproduction. How much memory was the node process consuming before it died? From what I understand, this is caused when trying to allocate a large object. Looking at your stack trace this appears to happen here when adding to the inverted index, so perhaps you managed to reach a limit on how much can be stored in the inverted index without upping the amount of space you give to node.

I think its probably worth opening a new issue for this, any resolution is unlikely to be directly related to the original issue.

emirotin commented 6 years ago

Thanks for the prompt response @olivernn, moved!

olivernn commented 6 years ago

@Knacktus how much of an impact did the latest version of Lunr have on your index building speed? Any improvements?

olivernn commented 6 years ago

I think this has gone stale, closing for now but feel free to comment and I'll re-open.