gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.34k stars 1.77k forks source link

How to scrap web-page data from predefined HTML source inside the main program source? #718

Closed rilysh closed 1 year ago

rilysh commented 2 years ago

I'm trying to scrap https://dnsdumpster.com, and to scrap exact values I need to pass few headers when requesting, but colly doesn't seems like having support for custom headers yet, only way I can see is Visit() function where I need to pass the URL. Is there any way to scrap a page from the main source code file?

For instance

package main

import "github.com/gocolly/colly"

func main() {
    page := "<title>hello</title>"
    c := colly.NewCollector()
}

Here instead of using c.Visit() function (after downloading the HTML file locally), how I can able to get title text from page variable?

The only way I saw that you need to download the whole page locally as I mentioned above, and using file://[path] the way to scrap from there, but in my opinion it looks a bad idea, as the program may not have the write permission which is required to write the HTML file on disk, another way, I can upload the downloaded HTML file on somewhere else, and then requesting to that URL, but it will slow down everything, downloading and uploading again and again. Is there a way to resolve it?

jeffthomasweb commented 2 years ago

Hi, you may find this example helpful https://github.com/gocolly/colly/blob/master/_examples/local_files/local_files.go . The example scrapes data from a local file and you could modify one of the c.OnHTML methods to get the text of a title element.

rilysh commented 2 years ago

sry, but I wrote "how to do the exact same thing without downloading the file locally" in previous msg.

MTariq99 commented 1 year ago

package main

import ( "fmt" "strings"

"github.com/gocolly/colly/v2"

)

func main() { page := `

hello
            <body>
                <h1>Hello, World!</h1>
            </body>
        </html>`

c := colly.NewCollector()

c.OnHTML("title", func(e *colly.HTMLElement) {
    title := strings.TrimSpace(e.Text)
    fmt.Println("Title:", title)
})

err := c.Visit("http://example.com")
if err != nil {
    fmt.Println("Error visiting:", err)
}

} This way, you can scrape the page directly without having to download it or upload it somewhere else. It's important to note that this approach is suitable for small pages or pages with static content in the variable. If you're dealing with larger pages or dynamic content, using a headless browser with a Golang wrapper like chromedp, might be more appropriate.

rilysh commented 1 year ago

package main

import ( "fmt" "strings"

"github.com/gocolly/colly/v2"

)

func main() { page := <html> <head> <title>hello</title> </head> <body> <h1>Hello, World!</h1> </body> </html>

c := colly.NewCollector()

c.OnHTML("title", func(e *colly.HTMLElement) {
  title := strings.TrimSpace(e.Text)
  fmt.Println("Title:", title)
})

err := c.Visit("http://example.com")
if err != nil {
  fmt.Println("Error visiting:", err)
}

} This way, you can scrape the page directly without having to download it or upload it somewhere else. It's important to note that this approach is suitable for small pages or pages with static content in the variable. If you're dealing with larger pages or dynamic content, using a headless browser with a Golang wrapper like chromedp, might be more appropriate.

@MTariq99 I think at that time I didn't clearly state that I want to parse the page on the fly (directly from the webpage URL) and get the value from the title attribute. Using something like a headless browser wrapping around Chromium sounds pretty "heavy" just for a command line application.

Again, you've done the same as the example shows. Also, I posted this question more than a year ago, likely I forgot to close the issue page.