ptwobrussell / Mining-the-Social-Web

The official online compendium for Mining the Social Web (O'Reilly, 2011)
http://bit.ly/135dHfs
Other
1.21k stars 491 forks source link

the_tweet__count_entities_in_tweets.py dose not work (example 5-4). #14

Closed onozka closed 12 years ago

onozka commented 12 years ago

Hi. I have a problem again. Sorry...

When I invoked script ($ python the_tweet__count_entities_in_tweets.py tweets-user-timeline-onozka 5), an error occurs. I wrote the error message below.

Please tell me how to solve this error. (Both example 5-2 and 5-3 worked correctly!)

And please tell me what [(row.key, row.value) for row in db.view('index/entity_count_by_doc', group=True)] means..

Traceback (most recent call last): File "the_tweet__count_entities_in_tweets.py", line 83, in entities_freqs = sorted([(row.key, row.value) for row in db.view('index/entity_count_by_doc', group=True)], key=lambda x: x[1], reverse=True) File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 984, in iter File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 1003, in rows File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 990, in _fetch File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 880, in _exec File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 393, in get_json File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 374, in get File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 419, in _request File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 310, in request couchdb.http.ServerError: (500, (u'EXIT', u'{{badmatch,[]},\n [{couch_query_servers,new_process,3},\n {couch_query_servers,lang_proc,3},\n {couch_query_servers,handle_call,3},\n {gen_server,handle_msg,5},\n {proc_lib,init_p_do_apply,3}]}'))

Sorry... X(

onozka commented 12 years ago

I realized that I had not installed couchdb but couch base single server. Now I'm trying to install couchdb. It's under installing. So, I'll retry invoking this code after installing finished.

onozka commented 12 years ago

I've installed couchDB and run the script but it occurs same error above... X( Please tell me how to solve it...

ptwobrussell commented 12 years ago

Unfortunately, the error message you're providing isn't giving me any obvious clues as to what's going on. Can you try working in an Python interpreter and poking around a bit? For example, you might just interactively work (copy and paste to save time, obviously) up until the point that you get to the line 83 that fails on you and then instead of executing it as written, try decomposing that list comprehension expression step by step. i.e. try executing db.view('index/entity_count_by_doc', group=True) and trying some debug strategies from there.

onozka commented 12 years ago

Hi, Matthew. Thank you for replying. I tried working in an Python interpreter. I'll write some executions.

import sys import couchdb from couchdb.design import ViewDefinition from prettytable import PrettyTable DB = 'tweets-user-timeline-onozka' server = couchdb.Server('http://localhost:5984') db = server[DB] FREQ_THRESHOLD = 3

def entityCountMapper(doc): ... if not doc.get('entities'): ... import twitter_text ... def getEntities(tweet): ... extractor = twitter_text.Extractor(tweet['text']) ... entities = {} ... entities['user_mentions'] = [] ... for um in extractor.extract_mentioned_screen_names_with_indices(): ... entities['user_mentions'].append(um) ... entities['hashtags'] = [] ... for ht in extractor.extract_hashtags_with_indices(): ... ht['text'] = ht['hashtag'] ... del ht['hashtag'] ... entities['hashtags'].append(ht) ... entities['urls'] = [] ... for url in extractor.extract_urls_with_indices(): ... entities['urls'].append(url) ... return entities ... doc['entities'] = getEntities(doc) ... if doc['entities'].get('user_mentions'): ... for user_mention in doc['entities']['user_mentions']: ... yield('@' + user_mention['screen_name'].lower(), [doc['_id'], doc['id']]) ... if doc['entities'].get('hashtags'): ... for hashtag in doc['entities']['hashtags']: ... yield('#' + hashtag['text'], [doc['_id'], doc['id']]) ... if doc['entities'].get('urls'): ... for url in doc['entities']['urls']: ... yield(url['url'], [doc['_id'], doc['id']]) ...

def summingReducer(keys, values, rereduce) :... if rereduce: ... return sum(values) ... else :... return len(values) ...

view = ViewDefinition('index', 'entity_count_by_doc', entityCountMapper, reduce_fun=summingReducer, language='python') Traceback (most recent call last): File "", line 1, in File "build/bdist.macosx-10.6-intel/egg/couchdb/design.py", line 93, in init File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 699, in getsource lines, lnum = getsourcelines(object) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 688, in getsourcelines lines, lnum = findsource(object) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 529, in findsource raise IOError('source code not available') IOError: source code not available

The error has occurred but the _design/index registered correctly. Maybe, this design file was made when I executed this code as written by $python the_tweet__count_entities_in_tweets.py tweets-user-timeline-onozka.

I'll copy&paste them from http://localhost:5984/_utils/document.html?tweets-user-timeline-onozka/_design/index.

_id _design/index _rev 1-4785071b1f7d14fe8a3a3394f2551b20

language python

views { "entity_count_by_doc": { "map": "def entityCountMapper(doc):\n if not doc.get('entities'):\n import twitter_text\n\n def getEntities(tweet):\n\n # Now extract various entities from it and build up a familiar structure\n\n extractor = twitter_text.Extractor(tweet['text'])\n\n # Note that the production Twitter API contains a few additional fields in\n # the entities hash that would require additional API calls to resolve\n\n entities = {}\n entities['user_mentions'] = []\n for um in extractor.extract_mentioned_screen_names_with_indices():\n entities['user_mentions'].append(um)\n\n entities['hashtags'] = []\n for ht in extractor.extract_hashtags_with_indices():\n\n # Massage field name to match production twitter api\n\n ht['text'] = ht['hashtag']\n del ht['hashtag']\n entities['hashtags'].append(ht)\n\n entities['urls'] = []\n for url in extractor.extract_urls_with_indices():\n entities['urls'].append(url)\n\n return entities\n\n doc['entities'] = getEntities(doc)\n\n if doc['entities'].get('user_mentions'):\n for user_mention in doc['entities']['user_mentions']:\n yield ('@' + user_mention['screen_name'].lower(), [doc['_id'], doc['id']])\n if doc['entities'].get('hashtags'):\n for hashtag in doc['entities']['hashtags']:\n yield ('#' + hashtag['text'], [doc['_id'], doc['id']])\n if doc['entities'].get('urls'):\n for url in doc['entities']['urls']:\n yield (url['url'], [doc['_id'], doc['id']])", "reduce": "def summingReducer(keys, values, rereduce):\n if rereduce:\n return sum(values)\n else:\n return len(values)" } }

Next, I tried some debugs.

print db.view('index/entity_count_by_doc') <ViewResults <PermanentView '_design/index/_view/entity_count_by_doc'> {}>

for row in db.view('index/entity_count_by_doc'): ... print row ... Traceback (most recent call last): File "", line 1, in File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 984, in iter File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 1003, in rows File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 990, in _fetch File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 880, in _exec File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 393, in get_json File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 374, in get File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 419, in _request File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 310, in request couchdb.http.ServerError: (500, (u'EXIT', u'{{badmatch,[]},\n [{couch_query_servers,new_process,3},\n {couch_query_servers,lang_proc,3},\n {couch_query_servers,handle_call,3},\n {gen_server,handle_msg,5},\n {proc_lib,init_p_do_apply,3}]}'))

db.view('index/entity_count_by_doc', group=True) <ViewResults <PermanentView '_design/index/_view/entity_count_by_doc'> {'group': True}>

for row in db.view('index/entity_count_by_doc', group=True): ... print row.id ... Traceback (most recent call last): File "", line 1, in File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 984, in iter File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 1003, in rows File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 990, in _fetch File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 880, in _exec File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 393, in get_json File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 374, in get File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 419, in _request File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 310, in request couchdb.http.ServerError: (500, (u'EXIT', u'{{badmatch,[]},\n [{couch_query_servers,new_process,3},\n {couch_query_servers,lang_proc,3},\n {couch_query_servers,handle_call,3},\n {gen_server,handle_msg,5},\n {proc_lib,init_p_do_apply,3}]}'))

for row in db.view('index/entity_count_by_doc', group=True): ... print row.key ... Traceback (most recent call last): File "", line 1, in File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 984, in iter File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 1003, in rows File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 990, in _fetch File "build/bdist.macosx-10.6-intel/egg/couchdb/client.py", line 880, in _exec File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 393, in get_json File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 374, in get File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 419, in _request File "build/bdist.macosx-10.6-intel/egg/couchdb/http.py", line 310, in request couchdb.http.ServerError: (500, (u'EXIT', u'{{badmatch,[]},\n [{couch_query_servers,new_process,3},\n {couch_query_servers,lang_proc,3},\n {couch_query_servers,handle_call,3},\n {gen_server,handle_msg,5},\n {proc_lib,init_p_do_apply,3}]}'))

Next,I tried another debugs.

for doc_id in db: ... doc = db.get('doc_id') ... print doc_id, doc ... _design/index None f1a04f6144c4f289248bb9f57a000f7e None f1a04f6144c4f289248bb9f57a001590 None f1a04f6144c4f289248bb9f57a0019d7 None f1a04f6144c4f289248bb9f57a0020ad None f1a04f6144c4f289248bb9f57a0028e5 None f1a04f6144c4f289248bb9f57a002c56 None f1a04f6144c4f289248bb9f57a003834 None f1a04f6144c4f289248bb9f57a003987 None f1a04f6144c4f289248bb9f57a004461 None f1a04f6144c4f289248bb9f57a004b48 None f1a04f6144c4f289248bb9f57a0053a7 None f1a04f6144c4f289248bb9f57a005605 None . . .

for row in db.view('_all_docs'): ... print row.id ... _design/index f1a04f6144c4f289248bb9f57a000f7e f1a04f6144c4f289248bb9f57a001590 f1a04f6144c4f289248bb9f57a0019d7 f1a04f6144c4f289248bb9f57a0020ad f1a04f6144c4f289248bb9f57a0028e5 f1a04f6144c4f289248bb9f57a002c56 f1a04f6144c4f289248bb9f57a003834 f1a04f6144c4f289248bb9f57a003987 f1a04f6144c4f289248bb9f57a004461 f1a04f6144c4f289248bb9f57a004b48 . . .

dbs = server.iter() for i in dbs: ... print i ... _replicator _users enron tweets-user-timeline-onozka

These are what I tried. Can these provide some clue? I hope it.

Thanks.

onozka commented 12 years ago

Oh, Indentations are deleted when I posted!! I indented correctly :D

ptwobrussell commented 12 years ago

This is probably a configuration issue of some sort.

The starting point to debug this is here:

view = ViewDefinition('index', 'entity_count_by_doc', entityCountMapper, reduce_fun=summingReducer, language='python') Traceback (most recent call last): File "", line 1, in File "build/bdist.macosx-10.6-intel/egg/couchdb/design.py", line 93, in init File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 699, in getsource ...

So, if you look at line 93 of design.py (that's inside of the CouchDB-0.7-py2.6.egg file for me), it's this:

map_fun = _strip_decorators(getsource(map_fun).rstrip())

And if you notice that the function getsource is referenced in that line (and also mentioned in the stack trace), you can then jump up to line 12 to see this one:

from inspect import getsource

So, somehow your getsource may not be getting imported correctly

In an interpreter try to simply type in "from inspect import getsource" and see what happens. If it imports them type "inspect.file" and see what physical path it references.

Might be also to "import couchdb" and see what file is referenced via "couchdb.file" as well.

Let me know what you find...

ptwobrussell commented 12 years ago

Ugg. the words in bold above should be formatted with double underscores on each side like this:

inspect.__file__
couchdb.__file__
MvanErp commented 12 years ago

I know this issue is closed, but I'm stil stuck. I'm also getting the IOError source code not available error on the line "view = ViewDefinition('index', 'entity_count_by_doc', entityCountMapper, reduce_fun=summingReducer, language='python')"

And I don't know how to solve it.

So inspect.file gives me

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.pyc

and

couchdb.file gives me

/Library/Python/2.7/site-packages/couchdb/init.pyc

Any chances that you could give me an additional hint? I tried googling for possible problems with the couchdb module and inspect, but to no avail...

ptwobrussell commented 12 years ago

Just to be clear, is your stack trace the exact same one? Paste it in just to be sure?

MvanErp commented 12 years ago

Wow! That's a quick response. Here's my trace:

view = ViewDefinition('index', 'entity_count_by_doc', entityCountMapper, reduce_fun=summingReducer, language='python') Traceback (most recent call last): File "", line 1, in File "/Library/Python/2.7/site-packages/couchdb/design.py", line 93, in init map_fun = _strip_decorators(getsource(map_fun).rstrip()) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 699, in getsource lines, lnum = getsourcelines(object) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 688, in getsourcelines lines, lnum = findsource(object) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 529, in findsource raise IOError('source code not available') IOError: source code not available

sdoud commented 12 years ago

I had the same problem. Here is the reason. The problem appeared when I was copying commands from the editor to the interpreter. When I run the whole script everything worked fine.

ptwobrussell commented 12 years ago

That's great news. Thanks for reporting. Not sure what is going on, but it's good to know what this is working.

uozias commented 11 years ago

I had the same problem because I didn't read chapter 3. After I added "[query_servers] python = path/to/couchpy" to local.ini in C:\Program Files (x86)\Apache Software Foundation\CouchDB\etc\couchdb, the script successfully worked.