gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.39k stars 1.77k forks source link

Issue with "Context" when using "queue" #458

Open BenKnigge opened 4 years ago

BenKnigge commented 4 years ago

The sample program provided will result in a panic

panic: interface conversion: interface {} is map[string]interface {}, not *main.node

The code only results in a panic when using a queue that has a "*node" "put" into the context of the request provided to the queue.

I'm not sure how a '*node' is being converted into a 'map[string]interface'

It's quite possible that I could be doing something wrong in which case I would love someone to let me know what the issue is.

package main

import (
    "bytes"
    "fmt"
    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/queue"
    "log"
    "reflect"
)

type node struct {
    url   string
    links []*node
}

type siteTree struct {
    name     string
    rootNode *node
}

func getContextNode(c *colly.Context) *node {
    i := c.GetAny("ParentNode")
    if i != nil {
                // The following line is causing a panic
        // Why is the type of i changing from '*node' to 'map[string]interface'
        fmt.Printf("context 'ParentNode' type : %s\n", reflect.TypeOf(i))
        return i.(*node)

    }
    return nil
}

func collect() *siteTree{
    q, err := queue.New(
        8,
        &queue.InMemoryQueueStorage{MaxSize: 10000},
    )
    if err != nil {
        log.Fatal(err.Error())
    }

    t := siteTree{name: "git hub site tree"}
    site := "https://github.com"
    domain := "github.com"

    c := colly.NewCollector(colly.Async(true))
    c.AllowURLRevisit = false
    c.AllowedDomains = []string{domain}

    c.OnRequest(func(r *colly.Request) {

        pnode := getContextNode(r.Ctx)
        newNode := &node{url: r.URL.String(), links: []*node{}}
        if pnode == nil {
            pnode = newNode
        }
        if t.rootNode == nil {
            //first node
            t.rootNode = newNode
        } else {
            pnode.links = append(pnode.links, newNode)
        }
        r.Ctx.Put("ParentNode", newNode)

        fmt.Println("Visiting", r.URL)

    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {

        // Couldn't get context to work correctly when using a queue
        var b []byte
        newRequest, err := e.Request.New("GET", e.Attr("href"), bytes.NewReader(b))
        pNode := getContextNode(e.Response.Ctx)
        if pNode != nil {
            newRequest.Ctx = colly.NewContext()
            newRequest.Ctx.Put("ParentNode", pNode)
        }
        err = q.AddRequest(newRequest)

        if err != nil {
            log.Printf(err.Error())
        }
    })

    c.OnScraped(func(r *colly.Response) {
        q.Run(c)
    })

    c.Visit(site)
    c.Wait()

    return &t
}

func main() {
    tree := collect()
    fmt.Println(tree)
}
BenKnigge commented 4 years ago

I'm exploring the idea that this may be an issue with the marshaling and marshaling of requests containing a context with a pointer value.

Could be the Marshal() method in request.go

Will likely explore this a bit more tomorrow.

BenKnigge commented 4 years ago

@asciimoo There really isn't any reason that I can think of to have an in memory queue require serialization of the request context.

Would you be open to a PR that would remove serialization from in memory storage?

I view this as a bug.

This would likely require changes to queue and storage.

At the very least I think that the documentation around queue, context, and requests should mention the limitations regarding putting pointers in a queue context.

asciimoo commented 4 years ago

The code only results in a panic when using a queue that has a "*node" "put" into the context of the request provided to the queue. I view this as a bug.

I'd say this is definitely a bug.

Would you be open to a PR that would remove serialization from in memory storage?

As I understand it would require to change the queue interface. What would be the advantage of the change?

BenKnigge commented 4 years ago

The advantage of the change would be that you could use a pointer in a context for an in memory queue. There would also be an improvement in performance for in memory storage as there would be no need to marshal to json and then back to a struct.

I think that you may want to add a couple of structs for SerializableRequest and SerializableContext for use when working with queues with non-memory storage.

Or perhaps create a QueueRequest and QueContext for use with queues.

With GetAny for QueContext taking the type and key of the interface stored in the contextmap and attempting to cast the interface in the map to the type passed in and recovering if the cast results in a panic.

If there is an approach that you would like to follow I'm open to implementing it.

asciimoo commented 4 years ago

The advantage of the change would be that you could use a pointer in a context for an in memory queue.

Correct me if I'm wrong, but I think it isn't possible to change only the inmemory queue, because it is accessed through the Storage interface from the collector.

If there is an approach that you would like to follow I'm open to implementing it.

I'm still thinking about it. The root cause is the wrong unmarshal of your custom node type, maybe there is a less drastic solution than changing the API.

asciimoo commented 4 years ago

As a workaround, you can marshal the node type before adding to the context and unmarshal after getting it or you can convert the returned map[string]interface{} to node type manually.

UPDATE: to be able to convert back map[string]interface{} to node you have to use public members in node to make it serializable.

Another alternative is to change the approach a bit and do not store the whole sitemap tree in the context, store only the parent url and build the tree independently from the collector. E.g. by adding an AddNode(parent, link string) member to siteTree which handles the tree building.

A little note: your current implementation isn't thread safe, there can be simultaneous write to the node.links member as I see.

BenKnigge commented 4 years ago

I can come up with several workarounds but I still think it's a bug.

The fact that the context is being marshaled and un-marshaled when using a queue with in memory storage is not expected behavior.

A little note: your current implementation isn't thread safe, there can be simultaneous write to the node.links member as I see.

There is a lock in each Collector, so that as long as I'm only using one collector and reading or writing within the collectors methods or after the collector has finished I shouldn't have an issue.

Am I mistaken about that?

BenKnigge commented 4 years ago

I think that If two methods were added to InMemoryQueueStorage for adding and requesting a pointer to a request, InMemoryQueueStorage could be fairly simply updated to use a pointer to a request.

This wouldn't break any existing code, no marshaling or unmarshaling would be necessary and would allow for a pointer type to be used in the contextmap for the request for a queue using InMemoryQueueStorage.

Alternatively the two methods could be added to the Storage interface but that would require that they be implemented in any other Storage method that currently exists.

asciimoo commented 4 years ago

There is a lock in each Collector, so that as long as I'm only using one collector and reading or writing within the collectors methods or after the collector has finished I shouldn't have an issue.

Queue runs paralelly in 8 threads, I think there can be parallel execution, but I have to check the code to verify my statement.

I think that If two methods were added to InMemoryQueueStorage for adding and requesting a pointer to a request, InMemoryQueueStorage could be fairly simply updated to use a pointer to a request.

You are right, I like this idea, could you work on this?

This wouldn't break any existing code, no marshaling or unmarshaling would be necessary and would allow for a pointer type to be used in the contextmap for the request for a queue using InMemoryQueueStorage.

Sounds good.

BenKnigge commented 4 years ago

Queue runs paralelly in 8 threads, I think there can be parallel execution, but I have to check the code to verify my statement.

Took a look at the code they are not threads but Goroutines using a single collector. So the lock will work as I thought.

You are right, I like this idea, could you work on this?

Yes, I'm thinking that I should likely put these two methods with the request pointer receiver in the Storage interface and deprecate the existing methods. Ideally the Storage API will remain consistent.

What do you think? What would you like me to name the two methods?

asciimoo commented 4 years ago

Yes, I'm thinking that I should likely put these two methods with the request pointer receiver in the Storage interface and deprecate the existing methods. Ideally the Storage API will remain consistent.

I'd add it only to the in memory storage, because as I see, serialization is unavoidable with other storage backends which persist the data. What do you think?

What do you think? What would you like me to name the two methods?

It's up to you, I'm not the best in naming things.. =) But if you want I can come up with some suggestions.

BenKnigge commented 4 years ago

I'll put something together and we can take a look at the code.

BenKnigge commented 4 years ago

PR has been created. Let me know your thoughts https://github.com/gocolly/colly/pull/460

BenKnigge commented 4 years ago

I've canceled the pr and am making a few more changes.