gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.18k stars 1.76k forks source link

[question] is possible to inspect or preserve html elements in xml request? #451

Open ionutnechita opened 4 years ago

ionutnechita commented 4 years ago

Hi @CollyTeam,

I tried to do a xml scan, but i did not find an answer to the html elements that i can't find. This is the method used by me:

c.OnXML("//testcase", func(e *colly.XMLElement) {
        temp := tc{}
        temp.Id = e.ChildText("/variables")
                testcase = append(testcase, temp)
}

In this element "/variables", I also have html elements. How could i parsing them? I have here a div and a table.

The element would contain this:

    <variables>
        <div xmlns="http://www.w3.org/1999/xhtml">
            <table border="1" dir="ltr" style="width: 1058px; table-layout: fixed; -ms-word-wrap: break-word;">
                <tbody>
                    <tr>
                    ...
                    </tr>
                    <tr>
                    ...
                    </tr>
                </tbody>
            </table>
        </div>
    </variables>

If this is not possible, for parsing this div and table. How could i keep all the html elements so that i can parse them separately?

nonzerofloat commented 4 years ago

Use html parser (goquery, net/html, ...)

colly do not support recursive decoding.

ionutnechita commented 4 years ago

Hi @gopherclass You can help me with a small example? colly and goquery or colly and net/html.

nonzerofloat commented 4 years ago

@ionutnechita https://github.com/PuerkitoBio/goquery#examples This will give you a starting point.

Package net/html i mentioned points to https://golang.org/x/net/html. The package provides low-level functionality to parse and manipulate html. If you use goquery, the package might not be seen.