mloesch / sickle

Sickle: OAI-PMH for Humans
Other
106 stars 42 forks source link

add huge_tree=True to the XMLParser used for responses. #55

Open jiemakel opened 2 years ago

jiemakel commented 2 years ago

Without huge_tree=True, lxml parsing apparently fails on certain, even slightly largish responses (apparently of more than 9.5MB).

Because also recover=True, from the viewpoint of Sickle, this happens silently. I only noticed it happening because it results in losing also the resumption token and therefore ending the crawl, upon which I started to wonder why I had way less records than I should have had.

Alternatively, if one wanted to get fancy, one might want to add the XMLParser to use as an optional parameter passed to Sickle and from then on down to the OAIResponse. This would allow people to customize for themselves what kind of XML parsing behaviour they want. For this PR however, I opted for the most simple fix.