gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.13k stars 1.76k forks source link

How should multiple collectors and redis be used together, any examples? #692

Open 1-bytes opened 2 years ago

1-bytes commented 2 years ago

After I clone a Collector, I'm not sure if I need to use the same storage and queue...

I referenced http://go-colly.org/docs/examples/redis_backend/ and http://go-colly.org/docs/examples/coursera_courses/

regards

1-bytes commented 2 years ago

Do I need to create a storage and queue for each collector?

I'm not sure if I should do this :(

example:

package main

import (
    "log"

    "github.com/gocolly/colly"
    "github.com/gocolly/colly/queue"
    "github.com/gocolly/redisstorage"
)

func main() {
    urls := []string{
        "http://httpbin.org/",
        "http://httpbin.org/ip",
        "http://httpbin.org/cookies/set?a=b&c=d",
        "http://httpbin.org/cookies",
    }
    urls2 := []string{
        "https://cmd5.org/",
        "https://cmd5.org/login.aspx",
    }

    c := colly.NewCollector()
    c2 := c.Clone()
    // create the redis storage
    storage := &redisstorage.Storage{
        Address:  "192.168.100.101:6379",
        Password: "",
        DB:       0,
        Prefix:   "httpbin_test",
    }
    storage2 := &redisstorage.Storage{
        Address:  "192.168.100.101:6379",
        Password: "",
        DB:       0,
        Prefix:   "cmd5.org",
    }

    // add storage to the collector
    c.SetStorage(storage)
    c2.SetStorage(storage2)

    // close redis client
    defer storage.Client.Close()
    defer storage2.Client.Close()

    // create a new request queue with redis storage backend
    q, _ := queue.New(3, storage)
    q2, _ := queue.New(4, storage2)

    c.OnResponse(func(r *colly.Response) {
        log.Println("[c]Cookies:", c.Cookies(r.Request.URL.String()))
    })

    c2.OnResponse(func(r *colly.Response) {
        log.Println("[c2]Cookies:", c.Cookies(r.Request.URL.String()))
    })
    // add URLs to the queue
    for _, u := range urls {
        q.AddURL(u)
    }
    for _, u := range urls2 {
        q2.AddURL(u)
    }
    // consume requests
    q.Run(c)
    q2.Run(c2)
}
1-bytes commented 2 years ago

I've been thinking if there is a more elegant way to achieve this? Because this results in a lot of duplicate code ...