michaelrsweet / mxml

Tiny XML library.
https://www.msweet.org/mxml
Apache License 2.0
426 stars 157 forks source link

How to read values if possible CDATA #307

Closed ejurgensen closed 6 months ago

ejurgensen commented 6 months ago

First of all thanks for making and maintaining mxml. It's been used by OwnTone (previously forked-daapd) for many years, and has been really solid.

I take care of OwnTone, where this issue came recently. It involves mxml parsing RSS XML with a tag that has CDATA. This creates a bug, since OwnTone just reads values with mxmlGetOpaque, which doesn't return anything when it's CDATA. So now I am seeking advice on what might be the best way to read a value, given that it is not known whether the input is CDATA or not.

Right now I'm thinking I should add a convience function like the below getStringValue(). However, it seems a bit crude to me. Is there a better way to do this?

#define CDATAXML "<?xml version=\"1.0\" ?><foo>FOO</foo><bar><![CDATA[BAR]]></bar>"

const char *
getStringValue(mxml_node_t *xml, const char *key)
{
  mxml_node_t *parent = mxmlFindElement(xml, xml, key, NULL, NULL, MXML_DESCEND);
  mxml_node_t *child  = mxmlGetFirstChild(parent);
  mxml_type_t type    = mxmlGetType(child);

  if (type == MXML_ELEMENT)
    return mxmlGetCDATA(child);

  return mxmlGetOpaque(child);
}

int
main(int argc, char *argv[])
{
  mxml_node_t *xml = mxmlLoadString(NULL, CDATAXML, MXML_OPAQUE_CALLBACK);

  printf("foo is '%s'\n", getStringValue(xml, "foo"));
  printf("bar is '%s'\n", getStringValue(xml, "bar"));
  return 0;
}
michaelrsweet commented 6 months ago

🤷‍♂️ That's probably as clean as you'll be able to make it.

ejurgensen commented 5 months ago

Fwiw I ended up with these wrappers:

mxml_node_t *
xml_get_node(mxml_node_t *top, const char *path)
{
  mxml_node_t *node;
  mxml_type_t type;

  // This example shows why we can't just return the result of mxmlFindPath:
  // <?xml version="1.0""?><rss>
  //    <channel>
  //        <title><![CDATA[Tissages]]></title>
  // mxmlFindPath(top, "rss/channel") will return an OPAQUE node where the
  // opaque value is just the whitespace. What we want is the ELEMENT parent,
  // because that's the one we can use to search for children nodes ("title").
  node = mxmlFindPath(top, path);
  type = mxmlGetType(node);
  if (type == MXML_ELEMENT)
    return node;

  return mxmlGetParent(node);
}

// Walks through the children of the "path" node until it finds one that is
// not just whitespace and returns a trimmed value (except for CDATA). Means
// that these variations will all give the same result:
//
// <foo>FOO FOO</foo><bar>\nBAR BAR \n</bar>
// <foo>FOO FOO</foo><bar><![CDATA[BAR BAR]]></bar>
// <foo>\nFOO FOO\n</foo><bar>\n<![CDATA[BAR BAR]]></bar>
const char *
xml_get_val(mxml_node_t *top, const char *path)
{
  mxml_node_t *parent;
  mxml_node_t *node;
  mxml_type_t type;
  const char *s = "";

  parent = xml_get_node(top, path);
  if (!parent)
    return NULL;

  for (node = mxmlGetFirstChild(parent); node; node = mxmlGetNextSibling(node))
    {
      type = mxmlGetType(node);
      if (type == MXML_OPAQUE)
    s = trim(mxmlGetOpaque(node));
      else if (type == MXML_ELEMENT)
        s = mxmlGetCDATA(node);

      if (s && *s != '\0')
    break;
    }

  return s;
}