medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
328 stars 59 forks source link

[install] [docker] crawler exception when creating project/corpus #143

Closed pierrejdlf closed 9 years ago

pierrejdlf commented 9 years ago

Bravo! for the packaging work done through docker I just succeeded installing it on my OSX 10.8.5 (very last master version from github this day)

the frontend seems to work, but when i create a new project, the scrapyd service crashes

i tried to create another project, but then i keep having the same exceptions.KeyError: 'hyphe.newproject' , exceptions.KeyError: 'hyphe.otherproject', etc... from the crawler service

Here is the traceback :

backend_1  | 2015-07-10 11:05:10+0000 [INFO - newproject] New corpus created
mongo_1    | 2015-07-10T11:05:10.407+0000 I INDEX    [conn2] build index on: hyphe.newproject.pages properties: { v: 1, key: { timestamp: 1 }, name: "timestamp_1", ns: "hyphe.newproject.pages", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.407+0000 I INDEX    [conn2] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.410+0000 I INDEX    [conn3] build index on: hyphe.newproject.pages properties: { v: 1, key: { _job: 1 }, name: "_job_1", ns: "hyphe.newproject.pages", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.410+0000 I INDEX    [conn3] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.413+0000 I INDEX    [conn4] build index on: hyphe.newproject.pages properties: { v: 1, key: { _job: 1, forgotten: 1 }, name: "_job_1_forgotten_1", ns: "hyphe.newproject.pages", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.414+0000 I INDEX    [conn4] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.416+0000 I INDEX    [conn5] build index on: hyphe.newproject.pages properties: { v: 1, key: { url: 1 }, name: "url_1", ns: "hyphe.newproject.pages", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.416+0000 I INDEX    [conn5] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.419+0000 I INDEX    [conn6] build index on: hyphe.newproject.queue properties: { v: 1, key: { timestamp: 1 }, name: "timestamp_1", ns: "hyphe.newproject.queue", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.419+0000 I INDEX    [conn6] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.421+0000 I INDEX    [conn7] build index on: hyphe.newproject.queue properties: { v: 1, key: { _job: 1, timestamp: -1 }, name: "_job_1_timestamp_-1", ns: "hyphe.newproject.queue", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.422+0000 I INDEX    [conn7] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.424+0000 I INDEX    [conn8] build index on: hyphe.newproject.logs properties: { v: 1, key: { timestamp: 1 }, name: "timestamp_1", ns: "hyphe.newproject.logs", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.424+0000 I INDEX    [conn8] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.426+0000 I INDEX    [conn10] build index on: hyphe.newproject.jobs properties: { v: 1, key: { crawling_status: 1 }, name: "crawling_status_1", ns: "hyphe.newproject.jobs", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.426+0000 I INDEX    [conn10] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.427+0000 I INDEX    [conn9] build index on: hyphe.newproject.jobs properties: { v: 1, key: { indexing_status: 1 }, name: "indexing_status_1", ns: "hyphe.newproject.jobs", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.427+0000 I INDEX    [conn9] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.429+0000 I INDEX    [conn11] build index on: hyphe.newproject.jobs properties: { v: 1, key: { webentity_id: 1 }, name: "webentity_id_1", ns: "hyphe.newproject.jobs", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.429+0000 I INDEX    [conn11] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.431+0000 I INDEX    [conn2] build index on: hyphe.newproject.jobs properties: { v: 1, key: { webentity_id: 1, created_at: 1 }, name: "webentity_id_1_created_at_1", ns: "hyphe.newproject.jobs", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.431+0000 I INDEX    [conn2] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.433+0000 I INDEX    [conn3] build index on: hyphe.newproject.jobs properties: { v: 1, key: { webentity_id: 1, created_at: -1 }, name: "webentity_id_1_created_at_-1", ns: "hyphe.newproject.jobs", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.433+0000 I INDEX    [conn3] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.435+0000 I INDEX    [conn4] build index on: hyphe.newproject.jobs properties: { v: 1, key: { webentity_id: 1, crawling_status: 1, indexing_status: 1, created_at: 1 }, name: "webentity_id_1_crawling_status_1_indexing_status_1_created_at_1", ns: "hyphe.newproject.jobs", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.435+0000 I INDEX    [conn4] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.436+0000 I INDEX    [conn5] build index on: hyphe.newproject.jobs properties: { v: 1, key: { crawling_status: 1, indexing_status: 1, created_at: 1 }, name: "crawling_status_1_indexing_status_1_created_at_1", ns: "hyphe.newproject.jobs", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.436+0000 I INDEX    [conn5] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.438+0000 I INDEX    [conn6] build index on: hyphe.newproject.stats properties: { v: 1, key: { timestamp: 1 }, name: "timestamp_1", ns: "hyphe.newproject.stats", safe: true, background: true }
mongo_1    | 2015-07-10T11:05:10.438+0000 I INDEX    [conn6] build index done.  scanned 0 total records. 0 secs
mongo_1    | 2015-07-10T11:05:10.590+0000 I NETWORK  [initandlisten] connection accepted from 172.17.0.11:58374 #24 (21 connections now open)
mongo_1    | 2015-07-10T11:05:10.594+0000 I NETWORK  [conn24] end connection 172.17.0.11:58374 (20 connections now open)
mongo_1    | 2015-07-10T11:05:10.595+0000 I NETWORK  [initandlisten] connection accepted from 172.17.0.11:58375 #25 (21 connections now open)
mongo_1    | 2015-07-10T11:05:10.601+0000 I NETWORK  [conn25] end connection 172.17.0.11:58375 (20 connections now open)
backend_1  | 2015-07-10 11:05:11+0000 [-] "192.168.59.3" - - [10/Jul/2015:11:05:11 +0000] "POST /hyphe-api/ HTTP/1.1" 200 247 "http://192.168.59.103:8000/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:39.0) Gecko/20100101 Firefox/39.0"
backend_1  | 2015-07-10 11:05:11+0000 [-] "192.168.59.3" - - [10/Jul/2015:11:05:11 +0000] "POST /hyphe-api/ HTTP/1.1" 200 540 "http://192.168.59.103:8000/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:39.0) Gecko/20100101 Firefox/39.0"
crawler_1  | 2015-07-10 11:05:11+0000 [-] "172.17.0.11" - - [10/Jul/2015:11:05:10 +0000] "GET /listprojects.json HTTP/1.0" 200 62 "-" "Twisted PageGetter"
backend_1  | 2015-07-10 11:05:11+0000 [ERROR - newproject] Couldn't deploy crawler
backend_1  | 2015-07-10 11:05:11+0000 [INFO - newproject] Starting corpus...
backend_1  | 2015-07-10 11:05:11+0000 [INFO - newproject] Starting MemoryStructure on port 13509 with 256Mo ram for at least 1800s (2304Mo ram and 9 ports left)
backend_1  | 2015-07-10 11:05:12+0000 [INFO - newproject] MemoryStructure ready
backend_1  | 2015-07-10 11:05:12+0000 [INFO - newproject] Saves default WE creation rule
backend_1  | 2015-07-10 11:05:13+0000 [-] "192.168.59.3" - - [10/Jul/2015:11:05:12 +0000] "POST /hyphe-api/ HTTP/1.1" 200 246 "http://192.168.59.103:8000/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:39.0) Gecko/20100101 Firefox/39.0"
backend_1  | 2015-07-10 11:05:13+0000 [-] "192.168.59.3" - - [10/Jul/2015:11:05:12 +0000] "POST /hyphe-api/ HTTP/1.1" 200 501 "http://192.168.59.103:8000/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:39.0) Gecko/20100101 Firefox/39.0"
backend_1  | 2015-07-10 11:05:15+0000 [-] "192.168.59.3" - - [10/Jul/2015:11:05:14 +0000] "POST /hyphe-api/ HTTP/1.1" 200 246 "http://192.168.59.103:8000/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:39.0) Gecko/20100101 Firefox/39.0"
backend_1  | 2015-07-10 11:05:15+0000 [-] "192.168.59.3" - - [10/Jul/2015:11:05:14 +0000] "POST /hyphe-api/ HTTP/1.1" 200 501 "http://192.168.59.103:8000/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:39.0) Gecko/20100101 Firefox/39.0"
crawler_1  | 2015-07-10 11:05:17+0000 [HTTPChannel,200,172.17.0.11] Unhandled Error
crawler_1  |    Traceback (most recent call last):
crawler_1  |      File "/usr/local/lib/python2.7/dist-packages/twisted/web/http.py", line 1755, in allContentReceived
crawler_1  |        req.requestReceived(command, path, version)
crawler_1  |      File "/usr/local/lib/python2.7/dist-packages/twisted/web/http.py", line 823, in requestReceived
crawler_1  |        self.process()
crawler_1  |      File "/usr/local/lib/python2.7/dist-packages/twisted/web/server.py", line 189, in process
crawler_1  |        self.render(resrc)
crawler_1  |      File "/usr/local/lib/python2.7/dist-packages/twisted/web/server.py", line 238, in render
crawler_1  |        body = resrc.render(self)
crawler_1  |    --- <exception caught here> ---
crawler_1  |      File "/usr/lib/pymodules/python2.7/scrapyd/webservice.py", line 17, in render
crawler_1  |        return JsonResource.render(self, txrequest)
crawler_1  |      File "/usr/lib/pymodules/python2.7/scrapyd/utils.py", line 19, in render
crawler_1  |        r = resource.Resource.render(self, txrequest)
crawler_1  |      File "/usr/local/lib/python2.7/dist-packages/twisted/web/resource.py", line 250, in render
crawler_1  |        return m(request)
crawler_1  |      File "/usr/lib/pymodules/python2.7/scrapyd/webservice.py", line 101, in render_GET
crawler_1  |        queue = self.root.poller.queues[project]
crawler_1  |    exceptions.KeyError: 'hyphe.newproject'
crawler_1  |    
crawler_1  | 2015-07-10 11:05:17+0000 [-] "172.17.0.11" - - [10/Jul/2015:11:05:16 +0000] "GET /listjobs.json?project=hyphe.newproject HTTP/1.0" 200 82 "-" "Twisted PageGetter"
backend_1  | 2015-07-10 11:05:17+0000 [WARNING - newproject] Problem dialoguing with scrapyd server: {u'message': u"'hyphe.newproject'", 'code': 'fail', u'node_name': u'9f95a6814eb2'}
boogheta commented 9 years ago

It looks a lot like #139. Can you try to change "scrapy", "deploy" into "scrapyd-deploy" line 93 of hyphe_backend/crawler/deploy.py then cleanup your prebuilt docker image and rebuild it?

PS @oncletom would you know how to make sure Docker installs scrapyd-0.17 & scrapy-0.24 and not further versions until I make hyphe compatible with multiple scrapy versions?

pierrejdlf commented 9 years ago

thanks, but using scrapy-deploydoesn't remove the exception logging on the hyphemaster_crawlerdocker image, gives:

# scrapy --version
Scrapy 0.18.0 - no active project
# scrapyd --version
twistd (the Twisted daemon) 15.2.1

not sure what to test next

boogheta commented 9 years ago

mmm, not sure either,

can you try and check/paste the content of the corpus collection in hyphe's mongoDB? And can you check scrapyd's local pages (localhost:6800, it should list projects, jobs, etc.)

thom4parisot commented 9 years ago

Scrapy is installed from these locations:

So maybe the backend (first link) installs which is way too recent? Should be just a matter of pinning Scrapy to the correct version (pip install Scrapy -> pip install Scrapy==0.24 or something like that).

boogheta commented 6 years ago

Hey @pierrejdlf, FYI, we finally released a new version with a more generic Docker installation process which should allow you to easily install Hyphe now.