gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.34k stars 1.77k forks source link

Queue is not working for loop #258

Open 0xTanvir opened 5 years ago

0xTanvir commented 5 years ago
func main() {
    links := ReadInput()
    q := AddRawUrl(links)

    c := colly.NewCollector()

    c.OnHTML(".listing-details", func(e *colly.HTMLElement) {
        result := []string{}
        result = append(result,e.Request.URL.String())
        result = append(result,e.ChildText(".wpbdp-field-company_name span"))
        result = append(result,e.ChildText(".wpbdp-field-type_of_business span"))
        result = append(result,e.ChildText(".wpbdp-field-company_address span"))
        result = append(result,e.ChildText(".wpbdp-field-country span"))
        result = append(result,e.ChildText(".wpbdp-field-phone_number span"))
        result = append(result,e.ChildText(".wpbdp-field-website span"))
        result = append(result,e.ChildText(".wpbdp-field-email span"))
        fmt.Println(result)
    })

    // Set error handler
    c.OnError(func(r *colly.Response, err error) {
        fmt.Println("Timeout at request: ", r.Request.URL.String(), "\n Now Retrying")
        r.Request.Retry()
    })

    c.OnScraped(func(r *colly.Response) {
        fmt.Println("Scrapped: record ", r.Request.URL.String())
    })

    q.Run(c)

    fmt.Println("Scraping finished....")
}

func ReadInput() []string{
    // Read from file
    b, err := ioutil.ReadFile("input.txt") // just pass the file name
    if err != nil {
        fmt.Print(err)
    }
    str := string(b) // convert content to a 'string'

    // split each row
    rows := strings.Split(str,"\n")
    return rows
}

func AddUrl(rows []string) *queue.Queue {
    Q, _ := queue.New(
        1, // Number of consumer threads
        &queue.InMemoryQueueStorage{MaxSize: 10000},
    )
    for _,url:=range rows{
        Q.AddURL(url)
    }
    return Q
}

func AddRawUrl(rows []string) *queue.Queue {
    Q, _ := queue.New(
        1, // Number of consumer threads
        &queue.InMemoryQueueStorage{MaxSize: 10000},
    )
    Q.AddURL("http://best-toy-importers.com/global-toy-importer-directory/feuchtmann-gmbh-spielwarenfabrik-3/")
    Q.AddURL("http://best-toy-importers.com/global-toy-importer-directory/limox-gmbh/")
    return Q
}

I don't know why AddUrl(rows []string) is not working where AddRawUrl(rows []string) is working good? what's wrong with AddUrl(rows []string) ?

input.txt contain

http://best-toy-importers.com/global-toy-importer-directory/feuchtmann-gmbh-spielwarenfabrik-3/
http://best-toy-importers.com/global-toy-importer-directory/limox-gmbh/
higorae commented 5 years ago

I have run the code that you have given as sample using AddRawUrl and AddUrl and it is working okay. I don't know if I am missing something. My outputs for both cases are the same.

`it@higor:~/testes/colly $ go run main.go running with AddUrl [http://best-toy-importers.com/global-toy-importer-directory/feuchtmann-gmbh-spielwarenfabrik-3/ FEUCHTMANN GMBH SPIELWARENFABRIK playthings, toys, gifts, hobby products manufacturer, importer and distributor in GERMANY - GERMANY +49 9843- www.feuchtmann-gmbh.de servi@feuchtmann-spielzeug.de] Scrapped: record http://best-toy-importers.com/global-toy-importer-directory/feuchtmann-gmbh-spielwarenfabrik-3/ [http://best-toy-importers.com/global-toy-importer-directory/limox-gmbh/ LIMOX GMBH playthings, toys, gifts, hobby products importer and distributor in GERMANY Lilienthalstrasse 13, 34123 Kassel GERMANY +49 561-507 www.limox.de offi@limox.de] Scrapped: record http://best-toy-importers.com/global-toy-importer-directory/limox-gmbh/ Scraping finished....

it@higor:~/testes/colly $ go run main.go running with AddRawUrl [http://best-toy-importers.com/global-toy-importer-directory/feuchtmann-gmbh-spielwarenfabrik-3/ FEUCHTMANN GMBH SPIELWARENFABRIK playthings, toys, gifts, hobby products manufacturer, importer and distributor in GERMANY - GERMANY +49 9843- www.feuchtmann-gmbh.de servi@feuchtmann-spielzeug.de] Scrapped: record http://best-toy-importers.com/global-toy-importer-directory/feuchtmann-gmbh-spielwarenfabrik-3/ [http://best-toy-importers.com/global-toy-importer-directory/limox-gmbh/ LIMOX GMBH playthings, toys, gifts, hobby products importer and distributor in GERMANY Lilienthalstrasse 13, 34123 Kassel GERMANY +49 561-507 www.limox.de offi@limox.de] Scrapped: record http://best-toy-importers.com/global-toy-importer-directory/limox-gmbh/ Scraping finished....`

sorry about the bad formatted output :(

0xTanvir commented 5 years ago

@higorae after checking your comment I was testing this code to my ubuntu, holy moly it's work like a charm. but on my windows this AddUrl still not working, just collecting the last result. strange...

higorae commented 5 years ago

@higorae after checking your comment I was testing this code to my ubuntu, holy moly it's work like a charm. but on my windows this AddUrl still not working, just collecting the last result. strange...

huuum... It's important to say that I have run in ubuntu as well. Maybe a bug on windows envirement. What is your windows version?

0xTanvir commented 5 years ago

@higorae after checking your comment I was testing this code to my ubuntu, holy moly it's work like a charm. but on my windows this AddUrl still not working, just collecting the last result. strange...

huuum... It's important to say that I have run in ubuntu as well. Maybe a bug on windows envirement. What is your windows version?

wv

dmakevic commented 5 years ago

Maybe the problem is that windows is the case insensitive. For the windows side func AddUrl and AddURL are the same functions. AddRawUrl function works as expected because has different name and call AddURL function.