ptwobrussell / Mining-the-Social-Web-2nd-Edition

The official online compendium for Mining the Social Web, 2nd Edition (O'Reilly, 2013)
http://bit.ly/135dHfs
Other
2.9k stars 1.49k forks source link

Chapter 9 example 6 #187

Open andrew-stebbing opened 10 years ago

andrew-stebbing commented 10 years ago

Hello,

I am having trouble getting the save_json function from chapter 9, example 6 to work correctly. I should state for the record that I'm running this code in a virtual environment using python 2.7.1 but following the format from the iPython notebooks as I don't want to be tied into the notebooks forever. If I save, and then load the results of a search I end up with output that's full of backslashes

"{\"search_metadata\": {\"count\": 1, \"completed_in\": 0.012, \"max_id_str\": \"464741216979283968\", \"since_id_str\": \"0\",

In the book Learning Python 5th ed by Mark Lutz I found some sample code for writing JSON to a file. So, if our data results are to be written to results.json the code would be:

json.dump(data, fp=open("./{0}.json".format(filename), 'w'), indent=1)

If I use this to save the data but use the load_json function to retrieve and print it I get what I'd expect:

{
 "search_metadata": {
  "count": 1, 
  "completed_in": 0.012, 
  "max_id_str": "464741216979283968", 
  "since_id_str": "0", 
  "query": "Oscar+Knox", 
  "max_id": 464741216979283968, 
  "refresh_url": "?since_id=464741216979283968&q=Oscar%20Knox&include_entities=1", 
  "next_results": "?max_id=464741216979283967&q=Oscar%20Knox&count=1&include_entities=1", 
  "since_id": 0
 }, 

Thus, it appears that it's the save_json function that's not working correctly. The code json.dump(data, fp=open("./{0}.json".format(filename), 'w'), indent=1) appears to just be an amalgam of json.dump and open('filename')

I've tried creating a a hybrid function similar to save_json but it doesn't work:

def save_json_hybrid(filename, data):
  with io.open("./{0}.json".format(filename), 'w', encoding="utf-8") as f:
    json.dump(data, f, indent=1) 

Using json.dump yields TypeError: must be unicode, not str whilst using json.dumps doesn't write anything. I'm fairly new to python so I'm rather struggling here.

For the time being I'm using this different version of save_json

def save_json_v2(filename, data):
    json.dump(data, fp=open("./{0}.json".format(filename), 'w'), indent=1)

which seems to work find for both trend and query searches but, as no one else seems to have raised this as an issue, I'm curious as to why it's not working correctly for me.

Regards Andrew

ptwobrussell commented 10 years ago

Are you saying that if you use the save_json and load_json functions as defined in Example 9-6 together as a pair that the results come back with escaped backslashes and such things? e.g. from running the example as-is, you get this behavior? Or, mixing and matching json.dumps, json.loads with these functions produces? I'm not sure how this would be happening, and just want to clarify the question. Here's an example IPython interpreter session that shows sample usage:

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import io, json
:
:def save_json(filename, data):
:    with io.open('resources/ch09-twittercookbook/{0}.json'.format(filename), 
:                 'w', encoding='utf-8') as f:
:        f.write(unicode(json.dumps(data, ensure_ascii=False)))
:
:def load_json(filename):
:    with io.open('resources/ch09-twittercookbook/{0}.json'.format(filename), 
:                 encoding='utf-8') as f:
:        return f.read()
:--

In [2]: foo = {"a" : "b"}

In [3]: save_json("foo.json", foo)

In [4]: bar = load_json("foo.json")

In [5]: print bar
{"a": "b"}

In [6]: print json.dumps(bar)
"{\"a\": \"b\"}"

I have the save_json and load_json functions here for precisely the reason you touch on - you get the dreaded Unicode errors with Python 2.7 when you're trying to serialize out non-ascii to a file -- hence, the wrappers. It's a messy situation, and you could probably spend the rest of the day reading up on Python 2.7 and the Unicode situation that was addressed with Python 3.x.

Also, bear in mind that there's a difference between a Python object and its JSONified representation on disk. A serialized JSON object is a bona fide "string", so certain values in that serialized representation like the quotes around key names then have to be escaped with backslashes.

Does this help at all?

andrew-stebbing commented 10 years ago

Matthew,

Thank you very much for your reply and may I take this opportunity to congratulate you on a fantastic book. Very enjoyable and informative.

To the matter in hand - firstly I was using the save_json and load_json functions as a pair. Interestingly, I ran everything again this morning inside the iPython notebook and I'm still getting the dreaded Unicode error. (Ah! so that's what it is. I've read a lot about it).

I ran examples 1, 3, 4 and 6 from Chapter 9 in sequence, with absolutely no changes to any of the code except the inclusion of my unique credentials in example 1.

Here's the beginning of the output from example 4 Searching for Tweets:

{
 "contributors": null, 
 "truncated": false, 
 "text": "Want to have 1 membership (no contracts) to 10 Nashville fitness studios? Pilates, Boot Camp, CrossFit, Cycle, Yoga, & more @fitmixnashville", 
 "in_reply_to_status_id": null, 
 "id": 465044875579523072, 
 "favorite_count": 0, 
 "source": "<a href=\"http://www.socialoomph.com\" rel=\"nofollow\">SocialOomph</a>", 
 "retweeted": false, 
 "coordinates": null, 
 "entities": {
  "symbols": [], 
  "user_mentions": [

...just as we'd expect. When it get's run through example 6 the dreaded Unicode error occurs.

"[{\"contributors\": null, \"truncated\": false, \"text\": \"\\\"Crossfit is like reverse fight club because the first rule of Crossfit is you never...\\\" \u2013 via @getsecret https://t.co/gHoLLNjme4\", \"in_reply_to_status_id\": null, \"id\": 465045090885328896, \"favorite_count\": 0, \"source\": \"<a href=\\\"http://www.apple.com\\\" rel=\\\"nofollow\\\">iOS</a>\", \"retweeted\": false, \"coordinates\": null, \"entities\": {\"symbols\": [], \"user_mentions\": 

I created the virtual environment and imported all the code on 26th April this year so I'm assuming I have the latest versions of everything.

Regards Andrew

ptwobrussell commented 10 years ago

This is interesting, and I do want to work with you to figure out what is going on. I am unable to reproduce this at the moment, but I don't doubt that you are getting the results that you say you are.

One point I should make is that, as written, Example 9-6 is standalone in terms of the data that it actually runs through save_json and load_json. The references to oauth_login() and twitter_search are just function references, so data input/output from previous examples such as Examples 9-3 or 9-4 shouldn't matter.

This code hasn't been updated in a while, so I think you do have the latest version of the code.

Since the Code in Example 9-6 is only pulling back 10 results, I wonder if you could use this code block below to try and reproduce the error and share with me the full output when the error occurs? GitHub will probably truncate it, so we may need to use something like a pastebin to get it all across.

import twitter
import io
import json

def oauth_login():
    # XXX: Go to http://twitter.com/apps/new to create an app and get values
    # for these credentials that you'll need to provide in place of these
    # empty string values that are defined as placeholders.
    # See https://dev.twitter.com/docs/auth/oauth for more information 
    # on Twitter's OAuth implementation.

    CONSUMER_KEY = ''
    CONSUMER_SECRET = ''
    OAUTH_TOKEN = ''
    OAUTH_TOKEN_SECRET = ''

    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)

    twitter_api = twitter.Twitter(auth=auth)
    return twitter_api

def twitter_search(twitter_api, q, max_results=200, **kw):

    # See https://dev.twitter.com/docs/api/1.1/get/search/tweets and 
    # https://dev.twitter.com/docs/using-search for details on advanced 
    # search criteria that may be useful for keyword arguments

    # See https://dev.twitter.com/docs/api/1.1/get/search/tweets    
    search_results = twitter_api.search.tweets(q=q, count=100, **kw)

    statuses = search_results['statuses']

    # Iterate through batches of results by following the cursor until we
    # reach the desired number of results, keeping in mind that OAuth users
    # can "only" make 180 search queries per 15-minute interval. See
    # https://dev.twitter.com/docs/rate-limiting/1.1/limits
    # for details. A reasonable number of results is ~1000, although
    # that number of results may not exist for all queries.

    # Enforce a reasonable limit
    max_results = min(1000, max_results)

    for _ in range(10): # 10*100 = 1000
        try:
            next_results = search_results['search_metadata']['next_results']
        except KeyError, e: # No more results when next_results doesn't exist
            break

        # Create a dictionary from next_results, which has the following form:
        # ?max_id=313519052523986943&q=NCAA&include_entities=1
        kwargs = dict([ kv.split('=') 
                        for kv in next_results[1:].split("&") ])

        search_results = twitter_api.search.tweets(**kwargs)
        statuses += search_results['statuses']

        if len(statuses) > max_results: 
            break

    return statuses

def save_json(filename, data):
    with io.open('resources/ch09-twittercookbook/{0}.json'.format(filename), 
                 'w', encoding='utf-8') as f:
        f.write(unicode(json.dumps(data, ensure_ascii=False)))

def load_json(filename):
    with io.open('resources/ch09-twittercookbook/{0}.json'.format(filename), 
                 encoding='utf-8') as f:
        return f.read()

# Sample usage

q = 'CrossFit'

twitter_api = oauth_login()
results = twitter_search(twitter_api, q, max_results=10)

print results # I'll be guaranteed to see the full text of the tweets with this

# But in theory, one of these calls is causing the error?
save_json(q, results)
results = load_json(q)

# Or is it this statement that you are saying is causing the error?
print json.dumps(results, indent=1)

What I'm curious to see is if you are ultimately finding that its the save_json and load_json calls that are causing the error, or if it's the print json.dumps(results, indent=1) statement that is the trigger. If it's the latter, I think I may already know what's going on. If it's the former, it'll be a bit more of a mystery.

andrew-stebbing commented 10 years ago

GitHub wouldn't let me post all results here so I've sent them to you via the email address listed on the 'Mining the Social Web' web-site.

LeiG commented 9 years ago

Any updates on this issue? I went into the same situation as mentioned. Thank you!

ptwobrussell commented 9 years ago

@LeiG - What specifically are you running into? UnicodeDecodeError? Can you provide more specifics and/or sample data that causes it?