gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.2k stars 1.76k forks source link

URL Filter to exclude #94

Closed cameronbraid closed 6 years ago

cameronbraid commented 6 years ago

Currently you can specify a URLFIlter to include URL, is there any way to exclude urls ?

asciimoo commented 6 years ago

The trivial answer would be the negative look ahead regular expression feature, but it isn't supported by Golang. There is a workaround mentioned in the first answer of this stackoverflow question: https://stackoverflow.com/questions/26771592/negative-look-ahead-go-regular-expressions . As an other option, you can create an OnRequest callback which cancels the request:

c.OnRequest(func(r *colly.Request) {
    if r.URL.Host == "unwantedhost.com" {
        r.Abort()
    }
})

Request.Abort() introduced in 44e13404eb54f3abfd28007d360c4bb5ef6fa9c3

cameronbraid commented 6 years ago

That looks like a good option. Thanks