Closed Panos512 closed 8 years ago
Can you provide test data and you code for bd1xx.py:79(authors)
?
You can try to get 1.0.1 behavior by setting repeated=False
in https://github.com/inveniosoftware/dojson/blob/master/dojson/overdo.py#L141 and https://github.com/inveniosoftware/dojson/blob/master/dojson/contrib/to_marc21/model.py#L68.
Changing labels because sensu stricto this is not a bug (=wrong behaviour), rather an enhancement request (=speed).
@Panos512 please try to create JSON from your MARCXML first and then run the conversion.
$ cat data.xml | dojson -l marcxml -d json | dojson do marc21
Here is the authors code of our bd1xx.py
:
https://github.com/inspirehep/inspire-next/blob/master/inspirehep/dojson/hep/fields/bd1xx.py#L33
And here is our test data
:
https://github.com/inspirehep/inspire-next/blob/master/inspirehep/demosite/data/demo-records.xml.gz
I will inform you about inserting from converted json files soon.
@jirikuncar Setting repeated=False
did the trick. Is this helpful somehow to identify the problem?
The behavior with repeated=True
is intended. You can write your own marcxml
reader that returns dict
instead of GroupableOrderedDict
.
@jirikuncar to be sure we are fully understanding: do you mean that by using dict
instead of GroupableOrderedDict
in https://github.com/inveniosoftware/dojson/blob/master/dojson/contrib/marc21/utils.py#L85 (i.e. at the subfield level) is enough to fix our performance issues?
@kaplun It's more on top level (see my comment above https://github.com/inveniosoftware/dojson/issues/145#issuecomment-220347109).
@jirikuncar can you clarify a bit more? I am not sure I am understanding.
Just to be clear in our usecase:
So, what do you exactly suggest us we should do in order not to have performance issues when using dojson>1.0.1
?
No you can create your own marcxml
loader that returns dict
.
We have experienced extreme CPU usage on records insertion after updating the verion of dojson from
1.0.1
to1.2.1
.We pinpointed the problem and it occurs on all versions after 1.0.1.
On
1.0.1
for 100 records it takes 2.5 seconds : Prun output: https://gist.github.com/kaplun/42d593e74f8b0821c6a3a754cdcfa6aeOn
1.2.1
for 100 records it takes 879 seconds: Prun output: https://gist.github.com/kaplun/3f4dd3065e041af862b3bc16b69e371aAfter
git bisect
we believe that the problem occurs on one of the following commits: