shriphani / pegasus

:racehorse:✈️ Pegasus is a scalable, modular, polite web-crawler for Clojure
http://getpegasus.io
Eclipse Public License 1.0
262 stars 17 forks source link

Nondeterministic hanging on enqueue-url #22

Closed dhruvbhatia closed 8 years ago

dhruvbhatia commented 8 years ago

Hi @shriphani,

Under the latest master release, It looks like the examples from README.md sometimes hang on the queue/enqueue-url fn.

I think this may have something to do with the XML parser choking on your RSS feed (I see you've added (catch Exception e nil) blocks in the examples) or the latest writer fixes you had mentioned. Below is my console output using either of the example code blocks when the hang occurs:

16-05-31 06:58:33 MacBook-Pro INFO [pegasus.process:52] - [[:frontier java.lang.String 5] [:extractor {:url java.lang.String, :body java.lang.String, :time Int} 5] [:update-state {:url java.lang.String, :body java.lang.String, :time Int, :extracted [java.lang.String]} 5] [:filter {:url java.lang.String, :body java.lang.String, :time Int, :extracted [java.lang.String]} 5] [:writer {:url java.lang.String, :body java.lang.String, :time Int, :extracted [java.lang.String]} 5] [:enqueue {:url java.lang.String, :body java.lang.String, :time Int, :extracted [java.lang.String]} 5] [:update-stats {:url java.lang.String, :body java.lang.String, :time Int, :extracted [java.lang.String]} 5] [:test-and-halt Any 5]]
16-05-31 06:58:33 MacBook-Pro INFO [pegasus.process:59] - :current-component :frontier
16-05-31 06:58:33 MacBook-Pro INFO [pegasus.process:59] - :current-component :extractor
16-05-31 06:58:33 MacBook-Pro INFO [pegasus.process:59] - :current-component :update-state
16-05-31 06:58:33 MacBook-Pro INFO [pegasus.process:59] - :current-component :filter
16-05-31 06:58:33 MacBook-Pro INFO [pegasus.process:59] - :current-component :writer
16-05-31 06:58:33 MacBook-Pro INFO [pegasus.process:59] - :current-component :enqueue
16-05-31 06:58:33 MacBook-Pro INFO [pegasus.process:59] - :current-component :update-stats
16-05-31 06:58:33 MacBook-Pro INFO [pegasus.process:59] - :current-component :test-and-halt
16-05-31 06:58:33 MacBook-Pro INFO [pegasus.core:77] - :starting-crawl
16-05-31 06:58:33 MacBook-Pro INFO [pegasus.queue:91] - :enqueue http://blog.shriphani.com/feeds/all.rss.xml
(it hangs here)

Note this only appears to happen some of the time, so it might possibly be related to network issues on my end. I'll keep exploring!

Edit: This also leads me to ask - what are your views on embedding examples under an ./examples directory within the pegasus project?

shriphani commented 8 years ago

Ah, so if you've got a crawl that ran the first time and the corpus/data-structure are on disk, then it won't proceed further (since it can't find a new url). If you point it at a different destination or nuke the file path, you should be ok.

shriphani commented 8 years ago

This probably needs a flag + log line.

dhruvbhatia commented 8 years ago

@shriphani Great, thanks for clarifying - yep, it was the caching at play!

One question - what if someone wanted to cache-bust and monitor let's say a dynamic Product Listing page (which changes indeterminately)? Would it make sense to add a config parameter to force pegasus to always scrape and timestamp corpuses, or is this better handled through giving the destination URL some kind of unique query parameter so that pegasus treats it as a unique page and proceeds to scrape it?

shriphani commented 8 years ago

Yes, I think the config should accept a parameter and flush all the caches.

I'm going to merge this with #25