indexdata / yaz

Z39.50 toolkit for C
http://www.indexdata.com/yaz
Other
43 stars 19 forks source link

Using xmlReadMemory instead of xmlParseMemory #91

Closed gjactat closed 1 year ago

gjactat commented 1 year ago

We encountered errors while querying a Solr server through Pazpar2. The problem was linked to empty nodes not being handled in the "yaz_solr_decode_facet_field" method.

Instead of applying the same recipe as found in other methods (testing node->type == XML_ELEMENT_NODE), I've tried to load the document with the XML_PARSE_NOBLANKS option).

This option removes the empty text nodes that happens when the Solr XML response is indented (line feeds between elements).

Of course, it's up to you to decide what is safer (testing nodes types or loading without blanks). But anyway, we had errors that crashed Pazpar2 with the following Solr server : https://api.archives-ouvertes.fr/search/

An example of Solr response that caused issues can be found here.

adamdickmeiss commented 1 year ago

Thx for this. We are using xmlParseMemory in most placesm so I think it's better to fix the decoder methods. All SRU/SRU/SOlr decoder methods should do be able to,

gjactat commented 1 year ago

In fact, xmlParseMemory is deprecated in favor of xmlReadMemory. I think I'll keep this change on my Yaz cloned repo. I'm not confident I could spot each and every line where the decoding could fail because of blanks nodes.

adamdickmeiss commented 1 year ago

In fact, xmlParseMemory is deprecated in favor of xmlReadMemory. I think I'll keep this change on my Yaz cloned repo. I'm not confident I could spot each and every line where the decoding could fail because of blanks nodes.

But you are using the method with options (which is not enabled by default) ... In particular the original "document" from Solr would have whitespace removed. In other words, to fix a problem with decoding of facets you ALSO remove whitespace from the document.

adamdickmeiss commented 1 year ago

Closing as decode methods are fixed in #93

gjactat commented 1 year ago

In fact, xmlParseMemory is deprecated in favor of xmlReadMemory. I think I'll keep this change on my Yaz cloned repo. I'm not confident I could spot each and every line where the decoding could fail because of blanks nodes.

But you are using the method with options (which is not enabled by default) ... In particular the original "document" from Solr would have whitespace removed. In other words, to fix a problem with decoding of facets you ALSO remove whitespace from the document.

To be clear, "XML_PARSE_NOBLANKS" does not remove all spaces from the document (which would be quite useless). It just ignores the empty text nodes induced by line feeds between xml element nodes. The C# XDocument parser ignores them by default.

Thank you for your changes #93. I have tested your changes and they fix the issue. Guillaume

adamdickmeiss commented 1 year ago

In fact, xmlParseMemory is deprecated in favor of xmlReadMemory. I think I'll keep this change on my Yaz cloned repo. I'm not confident I could spot each and every line where the decoding could fail because of blanks nodes.

But you are using the method with options (which is not enabled by default) ... In particular the original "document" from Solr would have whitespace removed. In other words, to fix a problem with decoding of facets you ALSO remove whitespace from the document.

To be clear, "XML_PARSE_NOBLANKS" does not remove all spaces from the document (which would be quite useless). It just ignores the empty text nodes induced by line feeds between xml element nodes. The C# XDocument parser ignores them by default.

Yep. But it's desirable to keep the whitespaces for readability.. the Solr did exactly that.

Thank you for your changes #93. I have tested your changes and they fix the issue. Guillaume

Welcome.

FWIW, even with XML_PARSE_NOBLANKS in use one could easily construct a "hostile" response , for example by populating foreign elements not known by the Solr decoder.

Eg

               <lst name="date">
                 <int name="1978">5</int>
               </lst>\n"
               <foreign>x</foreign>

This is not happening for a well-behaved Solr server of course. But it's foreign input and so must be considered.