headzoo / surf

Stateful programmatic web browsing in Go.
MIT License
1.49k stars 160 forks source link

New method: webpage "as user would see" - like htmlunit page.asNormalizedText() #128

Closed BigB84 closed 3 years ago

BigB84 commented 3 years ago

Hi, I write app that needs to read webpage as would user see, then save it to file (without interaction). Here's a webpage I need to process.

I've read the docs and tried to do it with bow.Body() but I get the html source so with tags like <pre> <p> so bufio reads it and it does mess, of course I can post-process removing all < started etc. but It's a lot of code to cover all scenarios.

I've done it in java once with htmlunits page.asNormalizedText() or python with selenium (I know there's selenium for go, but I'd rather omit additional webdriver config etc. that's why I also use your library :))

Do you think it'd be good to add such feature? Or if, you don't think it's a good idea, could you help me find other solution? Thanks in advance

headzoo commented 3 years ago

You could try using a specific css selector instead of bow.Body(). For instance.

bow.Dom().Find("body p pre").Each(func(_ int, s *goquery.Selection) {
    fmt.Println(s.Text())
})

That should give you the text inside of the inner

 tag.

BigB84 commented 3 years ago

Thanks! :)