The script goes to a page... finds all the .Each links loops through
bow.Find("#page #main #content #productsContainer #productLister ul li .product .productInner").Each(func(i int, s *goquery.Selection) {
title := strings.TrimSpace(s.Find(".productInfoWrapper .productInfo h3 a").Text())
click_err := bow.Click("a:contains(\"" + title + "\")")
if click_err != nil {
fmt.Println(click_err.Error())
} else {
tdesc := bow.Find("title").Text()
// commonly fails here due to request header being too large
if tdesc == "400 Bad Request" {
fmt.Printf("%v: NOT FOUND!! -- %v\n", i, tdesc)
}
}
// get some Text() from this page
bow.Back()
// I've had to add in the below to reset the request header
// is there a better way of doing this so that the .Click() doesn't slow down
// due to remote server resetting cookies?
if i != 0 {
if math.Mod(float64(i), 2) != 0 {
c := bow.SiteCookies()
cookieJar, _ := cookiejar.New(nil)
casurl, _ := url.Parse(url_link)
cookieJar.SetCookies(casurl, c)
bow.SetCookieJar(cookieJar)
// bow.DelRequestHeader("Cookie")
}
}
})
Basically the remote server gives 404 that the request header is too large. Something seems to add a new set of cookies to the header with every bow.Click().
bow.DelRequestHeader("Cookie") does get around the issue but it really slows the script down as then the cookies needs to be reset for every link!
If i could just keep to one set of cookies then I think the problem would be solved. Any ideas?
https://github.com/TheInsideMan/SainsburysScraper/tree/header (run instructions in README.md)
The script goes to a page... finds all the .Each links loops through
Basically the remote server gives 404 that the request header is too large. Something seems to add a new set of cookies to the header with every bow.Click().
bow.DelRequestHeader("Cookie") does get around the issue but it really slows the script down as then the cookies needs to be reset for every link!
If i could just keep to one set of cookies then I think the problem would be solved. Any ideas?