Don't crawl similar URL

Martin2877 commented 4 years ago

Hi guys. I want to thank you for the great tool. And there are some suggestions. As the pic above shows, there are many similar URL in one site, can I have some method to ignore them. as just fetch one of them?

tibug commented 4 years ago

https://github.com/tomnomnom/unfurl

Martin2877 commented 4 years ago

https://github.com/tomnomnom/unfurl

Thanks for the tool, but I‘d like to ignore when crawling not after.

jaikishantulswani commented 4 years ago

@j3ssie any way to avoid duplicate urls as it never ends on some domains and keep continue with duplicate urls: [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/

It also needed a switch to filter / skip urls throwing particular status code.

StasonJatham commented 3 years ago

Hi, so i am having the same problem. The issue is go-colly itself since they do not really take care of duplicates. When you scrape multiple domains it is actually pretty common to cause infinite loops.

Colly has an options redis backend that can take care of this: https://github.com/gocolly/redisstorage I updated it to support go-redis v8 (https://github.com/gocolly/redisstorage/issues/4#issuecomment-953871322) ...i sadly forgot to add ctx in one call but if you open this in VSCode it'll tell you.

I think the idea of this project was to be portable and such so it kind of makes sense to not force a database onto people. You could actually do this in memory as well.

That whole redisqueue thingy can be added in crawler.go right below "c := colly.NewCollector(" (just search for it)

I'll share my code when I a have fully implemented this. I actually have it running without colly in a much simpler scraper that just uses http and regex on the html

Here is how I solved the issue in my project (I have to Queues a toScrape and hasScraped):

            currentUrl := toScrapeQueue.Pop(nameOfQueue)
            if currentUrl == "" {
                c.Status = statusIdle
                continue
            }

            c.Status = statusPreparing

            if !notScrapepable(currentUrl) {
                log("Starting to crawl "+currentUrl, errorNotice)

                req := NewRequest(currentUrl)
                c.Status = statusResponse

                resp, err := req.Do()
                if err != nil {
                    logError(err)
                    continue
                }
                c.Status = statusParsing

                if resp.IsDiscarded {
                    log("Request to "+currentUrl+" discarded", errorNotice)
                    continue
                }
                log("Crawled "+currentUrl, errorNotice)

                // Redis
                allUrlsExtracted := extractURLs(string(resp.Body))
                for urlToTest := range allUrlsExtracted {
                    if !hasScrapedQueue.IsMember(urlToTest, redisHasScrapedQueue) {
                        toScrapeQueue.Push(urlToTest, redisToScrapeQueue)
                    }
                }
                hasScrapedQueue.UniqueAdd(currentUrl, redisHasScrapedQueue)

And my Redis Wrapper:

func (q *Queue) UniqueAdd(value string, key string) {
    newLength, err := q.red.SAdd(ctx, key, value).Result()
    if err != nil {
        log("Could not push item nr. ("+fmt.Sprint(newLength)+") ->"+err.Error(), errorError)
    }
}

func (q *Queue) IsMember(value string, key string) bool {
    isMember, _ := q.red.SIsMember(ctx, key, value).Result()
    return isMember
}

func (q *Queue) AllMembers(key string) []string {
    allMembers, _ := q.red.SMembers(ctx, key).Result()
    return allMembers
}

func (q *Queue) Size(key string) int64 {
    queueLen, _ := q.red.LLen(ctx, key).Result()
    return queueLen
}

func (q *Queue) SetSize(key string) int64 {
    queueLen, _ := q.red.SCard(ctx, key).Result()
    return queueLen
}

func (q *Queue) Pop(key string) string {
    poppedElement, err := q.red.LPop(ctx, key).Result()
    if err != nil {
        log("Could not pop ("+poppedElement+") ->"+err.Error(), errorError)
    }
    return poppedElement
}

func (q *Queue) Push(value string, key string) {
    newLength, err := q.red.LPush(ctx, key, value).Result()
    if err != nil {
        log("Could not push item nr. ("+fmt.Sprint(newLength)+") ->"+err.Error(), errorError)
    }
}

Redis SADD is actually not slowing me down as much as I thought, you can read about what I mean here: https://redis.io/commands/sadd

EDIT: Collyv2 doesn't support queue anymore....... lol

StasonJatham commented 3 years ago

I actually found a simpler fix: you can use HasVisited

The dolly documentation sucks.... I was searching through the code on how they implement that in memory check.

// HasVisited checks if the provided URL has been visited
func (c *Collector) HasVisited(URL string) (bool, error) {
    return c.checkHasVisited(URL, nil)
}

// HasPosted checks if the provided URL and requestData has been visited
// This method is useful more likely to prevent re-visit same URL and POST body
func (c *Collector) HasPosted(URL string, requestData map[string]string) (bool, error) {
    return c.checkHasVisited(URL, requestData)
}

which then calls

func (c *Collector) checkHasVisited(URL string, requestData map[string]string) (bool, error) {
    h := fnv.New64a()
    h.Write([]byte(URL))

    if requestData != nil {
        h.Write(streamToByte(createFormReader(requestData)))
    }

    return c.store.IsVisited(h.Sum64())
}

and returns a bool

so in crawler.go you could so something like

hasVisited, _ := crawler.C.HasVisited(urlString)
if !hasVisited{
    _ = e.Request.Visit(urlString)
}

ocervell commented 1 year ago

@j3ssie any plans to add this to avoid duplicate URLs ? In my case gospider keeps crawling on the same URL and gets stuck / never ends, so I have to kill it manually.

jaikishantulswani commented 9 months ago

@j3ssie any plans to add this to avoid duplicate URLs ? In my case gospider keeps crawling on the same URL and gets stuck / never ends, so I have to kill it manually.

jaeles-project / gospider

Don't crawl similar URL #21