gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.2k stars 1.76k forks source link

Suggestion: Add URL pattern blacklisting #131

Closed palvarezcordoba closed 6 years ago

palvarezcordoba commented 6 years ago

Hello,

There is a blacklist of domains, but not a blacklist of URL regexp's. I tried to implement it, I thinks I do it well, but it don't works. D: This is my code: https://github.com/OSPG/colly/commit/344b0fdafaef2fcf5ca2e9d817eadb43418db80c This an example of use: https://gist.github.com/palvarezcordoba/56d9a014a5262e258ba8b7bd83f98263

The above gist should not visit URLs starting with https://a, but it visit https://accounts.google.com/(...)

Anybody wants to help me to implement it? :smiley:

asciimoo commented 6 years ago

The above example could be implemented with URLFilters: Use regexp ^https?://[^a].

palvarezcordoba commented 6 years ago

And, without URL blacklisting is it easy to implement this?:

Allowed:
https://(www\.)?google\.(.*?)/q\?=
https://(www\.)?youtube\.com/(feed/trending|user|channel/UC-9-kyTW8ZkZNDHQJ6FgpwQ)
https://policies\.google\.com/privacy

Disallowed:
https://(www\.)?google.(.*?/)/imghp
https://accounts\.google\.com/(signin|accounts)
https://support\.google\.com
https://(www\.)?youtube\.com/channel/UCEgdi0XIXXZ-qJOFPf4JSKw
https://(mail|news)\.google\.(.*?)
https://policies\.google\.com/technologies
palvarezcordoba commented 6 years ago

LOL I was wrong. https://gist.github.com/palvarezcordoba/846b36bdae6d9700e44110f70e0390da

package main

import (
    "github.com/OSPG/colly"
    "log"
    "regexp"
)

func main() {
    r := regexp.MustCompile(`https://.*debian\.org/?.*`)
    r2 := regexp.MustCompile(`(png|jpg|jpeg|gif|ico|pdf|iso)$`)
    c := colly.NewCollector(
        colly.URLFilters(r),
        colly.DisallowedURLFilters(r2),
    )
    c.OnHTML("a", func(e *colly.HTMLElement) {
        link := e.Request.AbsoluteURL(e.Attr("href"))
        if len(link) > 0 {
            //          log.Println("I'm going to try to visit", link)
            c.Visit(link)
        }

    })
    c.OnRequest(func(r *colly.Request) {
        log.Println("visiting", r.URL.String())
    })
    log.Println(c.DisallowedURLFilters)
    c.Visit("https://debian.org")
}

My code works fine. I will open a PR.

asciimoo commented 6 years ago

added in #132