Open BenKnigge opened 4 years ago
I'm exploring the idea that this may be an issue with the marshaling and marshaling of requests containing a context with a pointer value.
Could be the Marshal() method in request.go
Will likely explore this a bit more tomorrow.
@asciimoo There really isn't any reason that I can think of to have an in memory queue require serialization of the request context.
Would you be open to a PR that would remove serialization from in memory storage?
I view this as a bug.
This would likely require changes to queue and storage.
At the very least I think that the documentation around queue, context, and requests should mention the limitations regarding putting pointers in a queue context.
The code only results in a panic when using a queue that has a "*node" "put" into the context of the request provided to the queue. I view this as a bug.
I'd say this is definitely a bug.
Would you be open to a PR that would remove serialization from in memory storage?
As I understand it would require to change the queue interface. What would be the advantage of the change?
The advantage of the change would be that you could use a pointer in a context for an in memory queue. There would also be an improvement in performance for in memory storage as there would be no need to marshal to json and then back to a struct.
I think that you may want to add a couple of structs for SerializableRequest and SerializableContext for use when working with queues with non-memory storage.
Or perhaps create a QueueRequest and QueContext for use with queues.
With GetAny for QueContext taking the type and key of the interface stored in the contextmap and attempting to cast the interface in the map to the type passed in and recovering if the cast results in a panic.
If there is an approach that you would like to follow I'm open to implementing it.
The advantage of the change would be that you could use a pointer in a context for an in memory queue.
Correct me if I'm wrong, but I think it isn't possible to change only the inmemory queue, because it is accessed through the Storage interface from the collector.
If there is an approach that you would like to follow I'm open to implementing it.
I'm still thinking about it. The root cause is the wrong unmarshal of your custom node
type, maybe there is a less drastic solution than changing the API.
As a workaround, you can marshal the node
type before adding to the context and unmarshal after getting it or you can convert the returned map[string]interface{}
to node
type manually.
UPDATE: to be able to convert back map[string]interface{}
to node
you have to use public members in node
to make it serializable.
Another alternative is to change the approach a bit and do not store the whole sitemap tree in the context, store only the parent url and build the tree independently from the collector. E.g. by adding an AddNode(parent, link string)
member to siteTree
which handles the tree building.
A little note: your current implementation isn't thread safe, there can be simultaneous write to the node.links
member as I see.
I can come up with several workarounds but I still think it's a bug.
The fact that the context is being marshaled and un-marshaled when using a queue with in memory storage is not expected behavior.
A little note: your current implementation isn't thread safe, there can be simultaneous write to the node.links member as I see.
There is a lock in each Collector, so that as long as I'm only using one collector and reading or writing within the collectors methods or after the collector has finished I shouldn't have an issue.
Am I mistaken about that?
I think that If two methods were added to InMemoryQueueStorage for adding and requesting a pointer to a request, InMemoryQueueStorage could be fairly simply updated to use a pointer to a request.
This wouldn't break any existing code, no marshaling or unmarshaling would be necessary and would allow for a pointer type to be used in the contextmap for the request for a queue using InMemoryQueueStorage.
Alternatively the two methods could be added to the Storage interface but that would require that they be implemented in any other Storage method that currently exists.
There is a lock in each Collector, so that as long as I'm only using one collector and reading or writing within the collectors methods or after the collector has finished I shouldn't have an issue.
Queue runs paralelly in 8 threads, I think there can be parallel execution, but I have to check the code to verify my statement.
I think that If two methods were added to InMemoryQueueStorage for adding and requesting a pointer to a request, InMemoryQueueStorage could be fairly simply updated to use a pointer to a request.
You are right, I like this idea, could you work on this?
This wouldn't break any existing code, no marshaling or unmarshaling would be necessary and would allow for a pointer type to be used in the contextmap for the request for a queue using InMemoryQueueStorage.
Sounds good.
Queue runs paralelly in 8 threads, I think there can be parallel execution, but I have to check the code to verify my statement.
Took a look at the code they are not threads but Goroutines using a single collector. So the lock will work as I thought.
You are right, I like this idea, could you work on this?
Yes, I'm thinking that I should likely put these two methods with the request pointer receiver in the Storage interface and deprecate the existing methods. Ideally the Storage API will remain consistent.
What do you think? What would you like me to name the two methods?
Yes, I'm thinking that I should likely put these two methods with the request pointer receiver in the Storage interface and deprecate the existing methods. Ideally the Storage API will remain consistent.
I'd add it only to the in memory storage, because as I see, serialization is unavoidable with other storage backends which persist the data. What do you think?
What do you think? What would you like me to name the two methods?
It's up to you, I'm not the best in naming things.. =) But if you want I can come up with some suggestions.
I'll put something together and we can take a look at the code.
PR has been created. Let me know your thoughts https://github.com/gocolly/colly/pull/460
I've canceled the pr and am making a few more changes.
The sample program provided will result in a panic
The code only results in a panic when using a queue that has a "*node" "put" into the context of the request provided to the queue.
I'm not sure how a '*node' is being converted into a 'map[string]interface'
It's quite possible that I could be doing something wrong in which case I would love someone to let me know what the issue is.