The parsing of larger TTL files seems to take a big performance hit from v1.2.x on

linkeddata / rdflib.js

Linked Data API for JavaScript

http://linkeddata.github.io/rdflib.js/doc/

Other

566 stars 146 forks source link

The parsing of larger TTL files seems to take a big performance hit from v1.2.x on #419

Open HugaertsDries opened 4 years ago

HugaertsDries commented 4 years ago

When trying to upgrading from v1.1.x to v1.2.x, I noticed a big performance hit when parsing files larger then 100 kB (122.9kb to be exact). Did something change in how files should be parsed?

The code used is a variant of the following:

import { graph as rdflibGraph, parse as rdflibParse } from 'rdflib';

const SOURCE_GRAPH = 'http://data.lblod.info/graphs/submission';

export function parse(sourceTtl)
    let store = rdflibGraph();
    rdflibParse(sourceTtl, this.store, SOURCE_GRAPH, 'text/turtle');
    return store;
}

Thx in advance!

megoth commented 4 years ago

Hmm, I suspect this might be tied to my changes in https://github.com/linkeddata/rdflib.js/commit/6d6284f2a18a98b8fad38a3ad812650f074507d2 =\ I don't know if you're able to test?

@timbl Maybe you have capacity?

timbl commented 4 years ago

The additional call to canon() you suggested. Sounded that could be it.

HugaertsDries commented 4 years ago

@megoth any suggestions on how I could test it?

ericprud commented 4 years ago

You could comment out the calls to canon and the update to this.index in src/store.ts in this function:

  add (
    subj: Quad_Subject | Quad | Quad[] | Statement | Statement[],
    pred?: Quad_Predicate,
    obj?: Term | string,
    why?: Quad_Graph
  ): Quad | null | IndexedFormula ...

If you want to play around directly with the JS (avoid the babel step), you could look in lib/store.js for

    key: "add",
    value: function add(subj, pred, obj, why) ...

If you're in a browser, you may want to disable minification by adding this to webpack.config.js:

optimization: {minimize: false},

TommasoBianchi commented 4 years ago

Hi everyone, any update on this? I'm also experiencing some significant performance hits (up to 10x slower) when parsing large XML files (tens of MB).

For instance, the NAL thesaurus (https://agclass.nal.usda.gov/downloads/NAL_Thesaurus_2020_SKOS.zip?agree3=on&image.x=45&image.y=15) takes more than 2 minutes to parse on my laptop, while it used to take 20/30 seconds on previous versions (I was on 1.0.6 before upgrading to 1.2.2).

TommasoBianchi commented 4 years ago

Up.