gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.4k stars 1.77k forks source link

Possibe to crawl and scrape text files? #572

Open prologic opened 3 years ago

prologic commented 3 years ago

I don't see an API for this besides Collector.OnResponse. Curious if its possible to scrap plain/text files? My use-case is scraping a bunch of hosted text files that may have linked to other text files.

prologic commented 3 years ago

Ping?

WGH- commented 3 years ago

What's your expectations? You want colly to find URLs in plain text using some heuristics?

prologic commented 3 years ago

What's your expectations? You want colly to find URLs in plain text using some heuristics?

Yes. I want the same semantics that the colly API provides for HTML, but for plain text. Look for (perhaps using regexes?) things that might look like links to other documents.

In my case I'm wanting to scrape twtxt feeds looking for other links of the form <nick url>. I realize this isn't exactly a HTML ink pe rse, but if the API could provide a custom search so I can define what to look for as "links" in documents?

WGH- commented 3 years ago

What's exactly wrong with doing such regex heuristics in OnResponse yourself?

prologic commented 3 years ago

This is what i did

WGH- commented 3 years ago

Strictly speaking, Colly doesn't really define what's an "HTML link" either: it's up to user to enqueue more URLs in his own custom c.OnHTML("a[href]", ...) handler.

I personally don't think there's a merit of incorporating such site-specific heuristics into Colly codebase, but let's see if there're other opinions about this.