gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.13k stars 1.76k forks source link

How to use colly for following requirements? #590

Open JDRanpariya opened 3 years ago

JDRanpariya commented 3 years ago

Question: I've to scrape different 10+ blogs for articles. I've to scrape fields like title, author, likes, content etc. but each site would have different css selector for the fields I've motioned before So how would I incorporate this in one colly instance. for example I can crawl all sites using c.Visit(site) and getting results for all sites but how do I write separate parsing pipeline for each site? Also does colly has concept of data pipeline like we have in python scrapy?

WGH- commented 3 years ago

You could add OnHTML("html", ... handler that would do e.DOM.Find(...) with a concrete selector which would depend on the site you're currently processing. Not ideal, but still better than instantiating different collectors, I think.

Also does colly has concept of data pipeline like we have in python scrapy?

I don't think so.

JDRanpariya commented 3 years ago

It's worth trying. I feel like I may have to add try catch or colly will just ignore if onHTML()'s specific selector is not found?

JDRanpariya commented 3 years ago

Also Is there any feature update which you guys are working on to get data pipeline?

WGH- commented 3 years ago

I feel like I may have to add try catch or colly will just ignore if onHTML()'s specific selector is not found?

"No matching elements" is not an error here.

Also Is there any feature update which you guys are working on to get data pipeline?

I don't think so. Colly leans more to the crawler framework side rather than scraping library.