Extreme CPU usage after updating to versions later than 1.0.1

inveniosoftware / dojson

Simple pythonic JSON to JSON converter.

https://dojson.readthedocs.io

Other

10 stars 29 forks source link

Closed Panos512 closed 8 years ago

Panos512 commented 8 years ago

We have experienced extreme CPU usage on records insertion after updating the verion of dojson from 1.0.1 to 1.2.1.

We pinpointed the problem and it occurs on all versions after 1.0.1.

After git bisect we believe that the problem occurs on one of the following commits:

jirikuncar commented 8 years ago

Can you provide test data and you code for bd1xx.py:79(authors)?

tiborsimko commented 8 years ago

Changing labels because sensu stricto this is not a bug (=wrong behaviour), rather an enhancement request (=speed).

jirikuncar commented 8 years ago

@Panos512 please try to create JSON from your MARCXML first and then run the conversion.

$ cat data.xml | dojson -l marcxml -d json | dojson do marc21

Panos512 commented 8 years ago

I will inform you about inserting from converted json files soon.

Panos512 commented 8 years ago

@jirikuncar Setting repeated=False did the trick. Is this helpful somehow to identify the problem?

jirikuncar commented 8 years ago

The behavior with repeated=True is intended. You can write your own marcxml reader that returns dict instead of GroupableOrderedDict.

kaplun commented 8 years ago

@jirikuncar to be sure we are fully understanding: do you mean that by using dict instead of GroupableOrderedDict in https://github.com/inveniosoftware/dojson/blob/master/dojson/contrib/marc21/utils.py#L85 (i.e. at the subfield level) is enough to fix our performance issues?

jirikuncar commented 8 years ago

kaplun commented 8 years ago

@jirikuncar can you clarify a bit more? I am not sure I am understanding.

Just to be clear in our usecase:

our MARCXML use-case needs to preserve older of fields, but does not need to preserve order of subfields.
we don't need the marc21 contrib, as we are not based on marc21

So, what do you exactly suggest us we should do in order not to have performance issues when using dojson>1.0.1?

jirikuncar commented 8 years ago

No you can create your own marcxml loader that returns dict.