tuananh / camaro

camaro is an utility to transform XML to JSON, using Node.js binding to native XML parser pugixml, one of the fastest XML parser around.
MIT License
553 stars 28 forks source link

Issue with whitespace #6

Open freshyill opened 7 years ago

freshyill commented 7 years ago

I'm having an issue with whitespace and I'm wondering if Camaro is handling it as-designed, or if I should look to another package to help with this.

Given this chunk of XML (truncated, but you get the idea)

<body>
 … 
he conducted research in immunology and rheumatology.</p>
</sec>
</sec>
<sec disp-level="1">
<title>Eye on 45</title>
<sec disp-level="2">
<title>Protests take shape</title>
<p>As U.S. President …

Using this to construct my template…

body: "article/body",

I get this result…

he conducted research in immunology and rheumatology.Eye on 45Protests take shapeAs U.S. President 

I do want to take the entire text of the body as just text, without any tags preserved. Should I expect to see a space character between where tags were stripped, or should it be concatenated like this?

tuananh commented 7 years ago

Since the html data in your example looks like valid xml, it get parsed as well. So when you query article/body, instead of getting a node with string content inside, you get a node with child node inside. get string value of that node will strip down all the tags inside it.

The proper way of putting data like this in XML is wrapping it inside CDATA like this

const transform = require('camaro')

const xml = `
<xml>
    <html>
        <![CDATA[
        <body>
            <p>
                ...he conducted research in immunology and rheumatology
            </p>
            <sec disp-level="1" />
            <title>Eye on 45</title>
            <sec disp-level="2" />
            <title>Protests take shape</title>
        </body>
        ]]>
    </html>
</xml>
`
const result = transform(xml, {
    html: 'xml/html'
})

console.log(JSON.stringify(result, null, 2))
freshyill commented 7 years ago

The XML I'm working with is as proper as it's going to get. This example uses JATS, which is a highly structured and quite strict DTD used in scholarly publishing.

It's possible my example wasn't entirely clear. I do want to strip all tags. In this case, I'm only interested in the text.

he conducted research in immunology and rheumatology.Eye on 45Protests take shapeAs U.S. President
                                                    ^         ^                  ^

I've marked where removed tags resulted in text being concatenated. Would you consider having a single space character be placed between removed tags instead of concatenating the text, maybe as an option?

tuananh commented 7 years ago

I see. You only want to place space char in place of those remove tags. For now, it's not possible because I don't check whether the path is a leaf node or contain child nodes inside.

freshyill commented 7 years ago

OK, thank you for considering!