Closed rilysh closed 1 year ago
Hi, you may find this example helpful https://github.com/gocolly/colly/blob/master/_examples/local_files/local_files.go . The example scrapes data from a local file and you could modify one of the c.OnHTML methods to get the text of a title element.
sry, but I wrote "how to do the exact same thing without downloading the file locally" in previous msg.
package main
import ( "fmt" "strings"
"github.com/gocolly/colly/v2"
)
func main() { page := `
<body>
<h1>Hello, World!</h1>
</body>
</html>`
c := colly.NewCollector()
c.OnHTML("title", func(e *colly.HTMLElement) {
title := strings.TrimSpace(e.Text)
fmt.Println("Title:", title)
})
err := c.Visit("http://example.com")
if err != nil {
fmt.Println("Error visiting:", err)
}
} This way, you can scrape the page directly without having to download it or upload it somewhere else. It's important to note that this approach is suitable for small pages or pages with static content in the variable. If you're dealing with larger pages or dynamic content, using a headless browser with a Golang wrapper like chromedp, might be more appropriate.
package main
import ( "fmt" "strings"
"github.com/gocolly/colly/v2"
)
func main() { page :=
<html> <head> <title>hello</title> </head> <body> <h1>Hello, World!</h1> </body> </html>
c := colly.NewCollector() c.OnHTML("title", func(e *colly.HTMLElement) { title := strings.TrimSpace(e.Text) fmt.Println("Title:", title) }) err := c.Visit("http://example.com") if err != nil { fmt.Println("Error visiting:", err) }
} This way, you can scrape the page directly without having to download it or upload it somewhere else. It's important to note that this approach is suitable for small pages or pages with static content in the variable. If you're dealing with larger pages or dynamic content, using a headless browser with a Golang wrapper like chromedp, might be more appropriate.
@MTariq99 I think at that time I didn't clearly state that I want to parse the page on the fly (directly from the webpage URL) and get the value from the title
attribute. Using something like a headless browser wrapping around Chromium sounds pretty "heavy" just for a command line application.
Again, you've done the same as the example shows. Also, I posted this question more than a year ago, likely I forgot to close the issue page.
I'm trying to scrap https://dnsdumpster.com, and to scrap exact values I need to pass few headers when requesting, but colly doesn't seems like having support for custom headers yet, only way I can see is
Visit()
function where I need to pass the URL. Is there any way to scrap a page from the main source code file?For instance
Here instead of using
c.Visit()
function (after downloading the HTML file locally), how I can able to gettitle
text from page variable?The only way I saw that you need to download the whole page locally as I mentioned above, and using
file://[path]
the way to scrap from there, but in my opinion it looks a bad idea, as the program may not have the write permission which is required to write the HTML file on disk, another way, I can upload the downloaded HTML file on somewhere else, and then requesting to that URL, but it will slow down everything, downloading and uploading again and again. Is there a way to resolve it?