`p5.XML.listChildren()` inserts `#text` unexpectedly

processing / p5.js

p5.js is a client-side JS platform that empowers artists, designers, students, and anyone to learn to code and express themselves creatively on the web. It is based on the core principles of Processing. http://twitter.com/p5xjs —

http://p5js.org/

GNU Lesser General Public License v2.1

21.1k stars 3.22k forks source link

`p5.XML.listChildren()` inserts `#text` unexpectedly #7047

Open nickmcintyre opened 1 month ago

nickmcintyre commented 1 month ago

Most appropriate sub-area of p5.js?

[ ] Accessibility
[ ] Color
[ ] Core/Environment/Rendering
[ ] Data
[ ] DOM
[ ] Events
[ ] Image
[X] IO
[ ] Math
[ ] Typography
[ ] Utilities
[ ] WebGL
[ ] Build process
[ ] Unit testing
[ ] Internationalization
[ ] Friendly errors
[ ] Other (specify if possible)

p5.js version

1.9.3

Web browser and version

Chrome 124.0.6367.208

Operating system

macOS 14.4.1

Steps to reproduce this

Steps:

Call myXML.listChildren().
View the returned array.

Snippet:

/*
<?xml version="1.0"?>
<animals>
  <mammal id="0" species="Capra hircus">Goat</mammal>
  <mammal id="1" species="Panthera pardus">Leopard</mammal>
  <mammal id="2" species="Equus zebra">Zebra</mammal>
  <reptile id="3" species="Caretta caretta">Turtle</reptile>
</animals>
*/

let myXML;

// Load the XML and create a p5.XML object.
function preload() {
  myXML = loadXML('assets/animals.xml');
}

function setup() {
  noCanvas();

  // Get the names of the element's children as an array.
  let children = myXML.listChildren();

  // Print the array to the console.
  // Desired: ["mammal", "mammal", "mammal", "reptile"]
  // Actual: ["#text", "mammal", "#text", "mammal", "#text", "mammal", "#text", "reptile", "#text"]
  print(children);
}

Here's the sketch for reference.

kris08052000 commented 1 month ago

Hi,

I found the bug,

If you run this correct code in here it works, I just did changes without setting up the code in my machine.

Go here add this place for testing https://editor.p5js.org/mcintyre/sketches/jnRZA8KHw
Add this code

let myXML;

// Load the XML and create a p5.XML object. function preload() { myXML = loadXML('assets/animals.xml'); }

function setup() { noCanvas();

// Get the names of the element's children as an array. let children = listChildren(myXML);

// Print the array to the console. // Desired: ["mammal", "mammal", "mammal", "reptile"] // Actual: ["#text", "mammal", "#text", "mammal", "#text", "mammal", "#text", "reptile", "#text"] print(children); }

Run it, it works.

kris08052000 commented 1 month ago

If you want to do full code changes in repository you need to make changes in these places.

[p5.js]/src/io/p5.XML.js : Here Inside listChildren() function, add the if statement to // Only include element nodes, filter out text nodes.

correct code :

listChildren(xml) { const arr = []; for (let i = 0; i < xml.DOM.childNodes.length; i++) { // Only include element nodes, filter out text nodes if (xml.DOM.childNodes[i].nodeType === 1) { // 1 is the nodeType for element nodes arr.push(xml.DOM.childNodes[i].nodeName); } } return arr; }

Inside the Setup() function : Give the "myxml" as arguments of the listChildren() function.

correct code :

function setup() { noCanvas();

// Get the names of the element's children as an array. let children = listChildren(myXML);

// Print the array to the console. // Desired: ["mammal", "mammal", "mammal", "reptile"] print(children); }

raquelcps commented 1 month ago

👋
I don't see that this bug has been assigned to anyone or that a PR is up so I'd like to volunteer...unless @kris08052000 you were planning to do it?

I suspect the suggested solution of checking for nodeType will work so I would follow through with that idea and also add a test to the existing suite.

I did read through the contributor docs, but I am new to this codebase and contributing community so apologies if I'm not following protocol in some way. ✌️

limzykenneth commented 1 month ago

Simply filtering out #text nodes will not work as #text nodes are legitimate child nodes. For example

<?xml version="1.0"?>
<animals>
  There are 4 animals here.
  <mammal id="0" species="Capra hircus">Goat</mammal>
  <mammal id="1" species="Panthera pardus">Leopard</mammal>
  <mammal id="2" species="Equus zebra">Zebra</mammal>
  <reptile id="3" species="Caretta caretta">Turtle</reptile>
</animals>

One would expect the #text node containing the string "There are 4 animals here." to be one of the child nodes of <animals>. The "extra" #text nodes that shows up in this case is because of how whitespace is treated in XML (and HTML as well), in that any whitespace either space characters or new line characters still count as text in the tags, more here.

A possible way is to only ignore #text nodes with only whitespace content but this implementation should first evaluate possible performance impact.

raquelcps commented 4 weeks ago

@limzykenneth thanks for the link and explanation about the string being considered a child.

So with listChildren(), my question is, what is the intended behavior as to what counts as a child - is it all nodes (which include elements, strings and comments) or strictly elements?

Based on the language used in the Reference docs, it seems like only elements should be included as children. Here'e the Reference description for listChildren() method:

Returns an array with the names of the element's child elements as Strings.

How I understand the description, I do not expect that calling the method on this

<?xml version="1.0"?>
<animals>
  There are 4 animals here.
  <mammal id="0" species="Capra hircus">Goat</mammal>
  <mammal id="1" species="Panthera pardus">Leopard</mammal>
  <mammal id="2" species="Equus zebra">Zebra</mammal>
  <reptile id="3" species="Caretta caretta">Turtle</reptile>
</animals>

will include "There are 4 animals here" in the returned array.

What's your take on the description and the intended behavior of the method?

limzykenneth commented 4 weeks ago

@raquelcps For me, elements is basically everything. What is being referred to as elements here is more usually referred to as tags as in HTML/XML tags, although the definition can drift.

Regardless, in terms of utility, I do think having #text included is usually more useful than not.

raquelcps commented 3 weeks ago

@limzykenneth so the expected behavior is to include all nodes as children. Cool, thanks for clarifying. Perhaps, separately from this bug fix, that could be made clearer in the Reference docs by adding an example that shows #text as part of the returned array?

And for this bug, adding test scenarios for both situations seems like a good idea.

I'll see if I can get started on this over the weekend.

limzykenneth commented 1 week ago

I think the blank #text node probably still make sense to exclude, probably something like a regex that see if a #text node's content is all whitespace or not, although I'm not sure if it will have significant performance impact with large enough XML file.