inveniosoftware / dojson

Simple pythonic JSON to JSON converter.
https://dojson.readthedocs.io
Other
10 stars 29 forks source link

Extreme CPU usage after updating to versions later than 1.0.1 #145

Closed Panos512 closed 8 years ago

Panos512 commented 8 years ago

We have experienced extreme CPU usage on records insertion after updating the verion of dojson from 1.0.1 to 1.2.1.

We pinpointed the problem and it occurs on all versions after 1.0.1.

On 1.0.1 for 100 records it takes 2.5 seconds : Prun output: https://gist.github.com/kaplun/42d593e74f8b0821c6a3a754cdcfa6ae

On 1.2.1 for 100 records it takes 879 seconds: Prun output: https://gist.github.com/kaplun/3f4dd3065e041af862b3bc16b69e371a

After git bisect we believe that the problem occurs on one of the following commits:

jirikuncar commented 8 years ago

Can you provide test data and you code for bd1xx.py:79(authors)?

You can try to get 1.0.1 behavior by setting repeated=False in https://github.com/inveniosoftware/dojson/blob/master/dojson/overdo.py#L141 and https://github.com/inveniosoftware/dojson/blob/master/dojson/contrib/to_marc21/model.py#L68.

tiborsimko commented 8 years ago

Changing labels because sensu stricto this is not a bug (=wrong behaviour), rather an enhancement request (=speed).

jirikuncar commented 8 years ago

@Panos512 please try to create JSON from your MARCXML first and then run the conversion.

$ cat data.xml | dojson -l marcxml -d json | dojson do marc21
Panos512 commented 8 years ago

Here is the authors code of our bd1xx.py: https://github.com/inspirehep/inspire-next/blob/master/inspirehep/dojson/hep/fields/bd1xx.py#L33

And here is our test data: https://github.com/inspirehep/inspire-next/blob/master/inspirehep/demosite/data/demo-records.xml.gz

I will inform you about inserting from converted json files soon.

Panos512 commented 8 years ago

@jirikuncar Setting repeated=False did the trick. Is this helpful somehow to identify the problem?

jirikuncar commented 8 years ago

The behavior with repeated=True is intended. You can write your own marcxml reader that returns dict instead of GroupableOrderedDict.

kaplun commented 8 years ago

@jirikuncar to be sure we are fully understanding: do you mean that by using dict instead of GroupableOrderedDict in https://github.com/inveniosoftware/dojson/blob/master/dojson/contrib/marc21/utils.py#L85 (i.e. at the subfield level) is enough to fix our performance issues?

jirikuncar commented 8 years ago

@kaplun It's more on top level (see my comment above https://github.com/inveniosoftware/dojson/issues/145#issuecomment-220347109).

kaplun commented 8 years ago

@jirikuncar can you clarify a bit more? I am not sure I am understanding.

Just to be clear in our usecase:

So, what do you exactly suggest us we should do in order not to have performance issues when using dojson>1.0.1?

jirikuncar commented 8 years ago

No you can create your own marcxml loader that returns dict.