Closed palvarezcordoba closed 6 years ago
The above example could be implemented with URLFilters
: Use regexp ^https?://[^a]
.
And, without URL blacklisting is it easy to implement this?:
Allowed:
https://(www\.)?google\.(.*?)/q\?=
https://(www\.)?youtube\.com/(feed/trending|user|channel/UC-9-kyTW8ZkZNDHQJ6FgpwQ)
https://policies\.google\.com/privacy
Disallowed:
https://(www\.)?google.(.*?/)/imghp
https://accounts\.google\.com/(signin|accounts)
https://support\.google\.com
https://(www\.)?youtube\.com/channel/UCEgdi0XIXXZ-qJOFPf4JSKw
https://(mail|news)\.google\.(.*?)
https://policies\.google\.com/technologies
LOL I was wrong. https://gist.github.com/palvarezcordoba/846b36bdae6d9700e44110f70e0390da
package main
import (
"github.com/OSPG/colly"
"log"
"regexp"
)
func main() {
r := regexp.MustCompile(`https://.*debian\.org/?.*`)
r2 := regexp.MustCompile(`(png|jpg|jpeg|gif|ico|pdf|iso)$`)
c := colly.NewCollector(
colly.URLFilters(r),
colly.DisallowedURLFilters(r2),
)
c.OnHTML("a", func(e *colly.HTMLElement) {
link := e.Request.AbsoluteURL(e.Attr("href"))
if len(link) > 0 {
// log.Println("I'm going to try to visit", link)
c.Visit(link)
}
})
c.OnRequest(func(r *colly.Request) {
log.Println("visiting", r.URL.String())
})
log.Println(c.DisallowedURLFilters)
c.Visit("https://debian.org")
}
My code works fine. I will open a PR.
added in #132
Hello,
There is a blacklist of domains, but not a blacklist of URL regexp's. I tried to implement it, I thinks I do it well, but it don't works. D: This is my code: https://github.com/OSPG/colly/commit/344b0fdafaef2fcf5ca2e9d817eadb43418db80c This an example of use: https://gist.github.com/palvarezcordoba/56d9a014a5262e258ba8b7bd83f98263
The above gist should not visit URLs starting with https://a, but it visit https://accounts.google.com/(...)
Anybody wants to help me to implement it? :smiley: