Open Martin2877 opened 4 years ago
Thanks for the tool, but I‘d like to ignore when crawling not after.
@j3ssie any way to avoid duplicate urls as it never ends on some domains and keep continue with duplicate urls: [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/
It also needed a switch to filter / skip urls throwing particular status code.
Hi, so i am having the same problem. The issue is go-colly itself since they do not really take care of duplicates. When you scrape multiple domains it is actually pretty common to cause infinite loops.
Colly has an options redis backend that can take care of this: https://github.com/gocolly/redisstorage I updated it to support go-redis v8 (https://github.com/gocolly/redisstorage/issues/4#issuecomment-953871322) ...i sadly forgot to add ctx in one call but if you open this in VSCode it'll tell you.
I think the idea of this project was to be portable and such so it kind of makes sense to not force a database onto people. You could actually do this in memory as well.
That whole redisqueue thingy can be added in crawler.go right below "c := colly.NewCollector(" (just search for it)
I'll share my code when I a have fully implemented this. I actually have it running without colly in a much simpler scraper that just uses http and regex on the html
Here is how I solved the issue in my project (I have to Queues a toScrape and hasScraped):
currentUrl := toScrapeQueue.Pop(nameOfQueue)
if currentUrl == "" {
c.Status = statusIdle
continue
}
c.Status = statusPreparing
if !notScrapepable(currentUrl) {
log("Starting to crawl "+currentUrl, errorNotice)
req := NewRequest(currentUrl)
c.Status = statusResponse
resp, err := req.Do()
if err != nil {
logError(err)
continue
}
c.Status = statusParsing
if resp.IsDiscarded {
log("Request to "+currentUrl+" discarded", errorNotice)
continue
}
log("Crawled "+currentUrl, errorNotice)
// Redis
allUrlsExtracted := extractURLs(string(resp.Body))
for urlToTest := range allUrlsExtracted {
if !hasScrapedQueue.IsMember(urlToTest, redisHasScrapedQueue) {
toScrapeQueue.Push(urlToTest, redisToScrapeQueue)
}
}
hasScrapedQueue.UniqueAdd(currentUrl, redisHasScrapedQueue)
And my Redis Wrapper:
func (q *Queue) UniqueAdd(value string, key string) {
newLength, err := q.red.SAdd(ctx, key, value).Result()
if err != nil {
log("Could not push item nr. ("+fmt.Sprint(newLength)+") ->"+err.Error(), errorError)
}
}
func (q *Queue) IsMember(value string, key string) bool {
isMember, _ := q.red.SIsMember(ctx, key, value).Result()
return isMember
}
func (q *Queue) AllMembers(key string) []string {
allMembers, _ := q.red.SMembers(ctx, key).Result()
return allMembers
}
func (q *Queue) Size(key string) int64 {
queueLen, _ := q.red.LLen(ctx, key).Result()
return queueLen
}
func (q *Queue) SetSize(key string) int64 {
queueLen, _ := q.red.SCard(ctx, key).Result()
return queueLen
}
func (q *Queue) Pop(key string) string {
poppedElement, err := q.red.LPop(ctx, key).Result()
if err != nil {
log("Could not pop ("+poppedElement+") ->"+err.Error(), errorError)
}
return poppedElement
}
func (q *Queue) Push(value string, key string) {
newLength, err := q.red.LPush(ctx, key, value).Result()
if err != nil {
log("Could not push item nr. ("+fmt.Sprint(newLength)+") ->"+err.Error(), errorError)
}
}
Redis SADD is actually not slowing me down as much as I thought, you can read about what I mean here: https://redis.io/commands/sadd
EDIT: Collyv2 doesn't support queue anymore....... lol
I actually found a simpler fix: you can use HasVisited
The dolly documentation sucks.... I was searching through the code on how they implement that in memory check.
// HasVisited checks if the provided URL has been visited
func (c *Collector) HasVisited(URL string) (bool, error) {
return c.checkHasVisited(URL, nil)
}
// HasPosted checks if the provided URL and requestData has been visited
// This method is useful more likely to prevent re-visit same URL and POST body
func (c *Collector) HasPosted(URL string, requestData map[string]string) (bool, error) {
return c.checkHasVisited(URL, requestData)
}
which then calls
func (c *Collector) checkHasVisited(URL string, requestData map[string]string) (bool, error) {
h := fnv.New64a()
h.Write([]byte(URL))
if requestData != nil {
h.Write(streamToByte(createFormReader(requestData)))
}
return c.store.IsVisited(h.Sum64())
}
and returns a bool
so in crawler.go you could so something like
hasVisited, _ := crawler.C.HasVisited(urlString)
if !hasVisited{
_ = e.Request.Visit(urlString)
}
@j3ssie any plans to add this to avoid duplicate URLs ? In my case gospider
keeps crawling on the same URL and gets stuck / never ends, so I have to kill it manually.
@j3ssie any plans to add this to avoid duplicate URLs ? In my case
gospider
keeps crawling on the same URL and gets stuck / never ends, so I have to kill it manually.
Hi guys. I want to thank you for the great tool. And there are some suggestions. As the pic above shows, there are many similar URL in one site, can I have some method to ignore them. as just fetch one of them?