lestrrat-go / libxml2

Interface to libxml2, with DOM interface
MIT License
230 stars 56 forks source link

node.String() don't return pair of html tag when htmlCode with scirpt tag. #10

Closed wxf4150 closed 8 years ago

wxf4150 commented 8 years ago
func TestNodeStringWithScriptTag(t *testing.T){
    scirptTag:=`<script type="text/x-template" title="searchResultsGrid">
            <table class="aui">
                <thead>
                <tr class="header">
                    <th class="search-result-title">Page Title</th>
                    <th class="search-result-space">Space</th>
                    <th class="search-result-date">Updated</th>
                </tr>
                </thead>
            </table>
        </script>`

    doc, err := ParseHTMLString(scirptTag)
    if !assert.NoError(t, err, "ParseHTMLString should succeed") {
        return
    }

    nodes := xpath.NodeList(doc.Find(`.//script`))
    if !assert.NotEmpty(t, nodes, "Xpath Find should succeed") {
        return
    }

    v:= nodes.String()

    if !assert.NotEmpty(t, v, "Literal() should return some string") {
        return
    }
    if !assert.Equal(t,scirptTag,v, "String() and   var scirptTag   should equal") {
        return
    }
    t.Logf("v = '%s'", v)
}

nodes.String() lost below tags

  </th>       </tr>     </thead>    </table>

i had forked and add a test file here: https://github.com/wxf4150/go-libxml2/commit/27db593c90965569a3f9ab1aa73e4b268545a72d

lestrrat commented 8 years ago

That piece of HTML doesn't look right. <script> tags should not contain HTML inside them.

For example, try running vanilla xmllint:

shoebill% xmllint -html hoge.txt 
hoge.txt:5: HTML parser error : Unexpected end tag : th
                    <th class="search-result-title">Page Title</th>
                                                                   ^
hoge.txt:6: HTML parser error : Unexpected end tag : th
                    <th class="search-result-space">Space</th>
                                                              ^
hoge.txt:7: HTML parser error : Unexpected end tag : th
                    <th class="search-result-date">Updated</th>
                                                               ^
hoge.txt:8: HTML parser error : Unexpected end tag : tr
                </tr>
                     ^
hoge.txt:9: HTML parser error : Unexpected end tag : thead
                </thead>
                        ^
hoge.txt:10: HTML parser error : Unexpected end tag : table
            </table>
                    ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><script type="text/x-template" title="searchResultsGrid">
            <table class="aui">
                <thead>
                <tr class="header">
                    <th class="search-result-title">Page Title
                    <th class="search-result-space">Space
                    <th class="search-result-date">Updated

        </script></head></html>

libxml2 can't really handle it (well it can, but it doesn't quite work as you expect it), so there's no way go-libxml2 will be able to handle it.

wxf4150 commented 8 years ago

the code from source of https://docs.strongloop.com/display/public/LB/Include+filter
“Powered by Atlassian Confluence ” maybe the usage is right .

i found moovweb/gokogiri can parsed the code. and return the right things

lestrrat commented 8 years ago

maybe the usage is right .

No, per spec, it's wrong. The HTML 4 DTD clearly specifies that <script> tags should contain only CDATA. Browsers can interpret it however they want to. But by default (and I stress the by default) libxml2 is a validating XML parser, and it should honor DTDs.

i found moovweb/gokogiri can parsed the code. and return the right things

well, then it would be very nice if you said so to begin with, or included code snippets!

So, now that I know what to compare with, and looking at gokogiri's code, it uses HTML_PARSE_RECOVER by default.

That means that all recoverable errors are automatically recovered, at best effort.

I personally didn't see why you would want this behavior when you explicitly choose to use a validating parser like libxml2, and other libraries that I based my version upon don't have that turned on either, so it's not included by default. This is why gokogiri and my go-libxml2 differ.

You are welcome to submit PRs to change the default behavior, but please make sure to understand how libxml2 works when you turn it on by default, and provide tests. Also at first glance, there's still a tiny bit of difference in the output even if you enable HTMLParseRecover. This could be my bug, or some other special treatment done by gokogiri.

Given a good PR, it will probably be merged. You should just note that my personal goal was never, and will never be, to create a libxml2 binding that is suitable for parsing HTML (I work with XML): I welcome PRs, but please don't expect me to fix these things

wxf4150 commented 8 years ago

thank you, talk about these . The project should have it speclail goal( for xml)

i want the carry some usefull open source office document site to our country. all these website is foreign , access these site are very slow from our country . and these site reference google api or facebook api (and these api are all forbid by country firewall). i will use other method ,maybe regex is enougth.