gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.2k stars 1.76k forks source link

ChildAttr only returns one child, not all children #50

Closed AwolDes closed 6 years ago

AwolDes commented 6 years ago

[Not really an issue] Hey mate

I've been using Colly for a small scraping project and I've come across a weird bit of behaviour.

The e.ChildText() function returns the text in all of the children as one string. However, using e.ChildAttr() only returns the first match. I read through the code in colly.go and understand this is the intended behaviour, but I was wondering why you wouldn't want to return all child attributes?

Loving this package though, it's been a lot of fun to use. Thank you for keeping it up to date! Cheers

asciimoo commented 6 years ago

@AwolDes thanks for your feedback. This behavior comes from the goquery package and I found it intuitive - perhaps it isn't. Goquery returns the text of all the descendants of the matched element. Maybe, we should add an e.ChildAttrs() function which returns with a list of all matching element's attributes. What do you think?

AwolDes commented 6 years ago

@asciimoo I think it would be good to add the e.ChildAttrs() function so that a HTML snippet like the following:

<div class="block-elem">
  <div class="container">
    <span class="span-class">Text</span>
  </div>
  <div class="container">
    <span class="span-class">Text2</span>
  </div>
</div>

Could be easily parsed with something like

c.OnHTML("div.block-elem", func(e *colly.HTMLElement) {
  spanClass := e.ChildAttrs("span", "class")
})

To get all the span classes, instead of just the first match

asciimoo commented 6 years ago

Added in ac6587e