gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
22.58k stars 1.74k forks source link

Extract JS Code (Not execute) #643

Open pdavis156879 opened 2 years ago

pdavis156879 commented 2 years ago

I'm attempting to extract/locate Javascript code within an HTML page; whilst Colly is not a headless browser and hence, JS execution is not a feature, I don't actually need to execute the code, just to locate a subset (or even a set of strings) based on their names and other similar features.

Any chance anyone stumbled upon this?

Mxrk commented 2 years ago

Hey, maybe this part here helps you. I am using this in a tiny project to find an array:

localCollector.OnHTML("body", func(e *colly.HTMLElement) {
        s := e.DOM.Find("script").Text()

        r := regexp.MustCompile(`something\.array\s*=\s*(.+\}])\s*`)
        res := r.FindString(s)
        res = strings.ReplaceAll(res, "something.array = ", "")
...
})

With that you can freely search in the script context. In my case I can parse it into an struct and use the given array. Not sure if that is exactly what you want.

kulak commented 2 years ago

My example to find script element with id __NEXTDATA_\:

c.OnHTML("script#__NEXT_DATA__", func(h *colly.HTMLElement) {
    var js map[string]interface{}
    err := json.Unmarshal([]byte(h.Text), &js)
    if err != nil {
        panic(errors.New("can't parse script#__NEXT_DATA__"))
    }
})

It is tested to work on script element inside body.

Since there is a functional solution, I think the issue should be closed.

RensTillmann commented 1 year ago

JS != JSON 🚨

kulak commented 1 year ago

I was working with a specific site and its __NEXT_DATA__ was formatted as json object. So, I got to structured data that I had interest in extracting.

Yes, generic case is JS, but this specific case was about JSON data in JS.

RensTillmann commented 1 year ago

I figured something like that, but isn't that a bit strange to have pure json object inside <script> tag? 😛 Anyway thanks for the info.