ptwobrussell / Mining-the-Social-Web

The official online compendium for Mining the Social Web (O'Reilly, 2011)
http://bit.ly/135dHfs
Other
1.21k stars 491 forks source link

Sampling technique doesn't work #6

Closed dpiponi closed 13 years ago

dpiponi commented 13 years ago

In twitter__util.py, line 107, the code lst = lst[:int(len(lst) * sample)] fails to trim the lists as intended because it's assigning the trimmed list to lst, not screen_names or user_ids.

ptwobrussell commented 13 years ago

Thanks for noticing this. It looks like you are indeed correct. I think this slipped through during a final pass of refactoring. Fortunately, it didn't make it into print (twitter__util.py isn't in print). I'll make a note to fix this in the next day or so. In the meanwhile, feel free to send me a pull request if you've already fixed it and made any other improvements. Thanks again.

ptwobrussell commented 13 years ago

Just getting around to cleaning house. Sorry for this delay. After taking a look at the issue more closely, I don't think there is actually a bug with lines 104-107 in twitter__utill.py after all. Lines 104-107 follow for convenience:

if sample < 1.0:
    for lst in [screen_names, user_ids]:
        shuffle(lst)
        lst = lst[:int(len(lst) * sample)]

What's happening is that lst here is a reference to screen_names and user_ids and the shuffle() and assignment operations that take place in the body of the loop on lst are passed through to screen_names and user_ids. You can check this out for yourself in the interpreter with a similar test case to see that the changes to lst pass through via the references:

>>> a = []
>>> b = []
>>> for lst in [a,b]:
...     lst.append('foo')
...     lst = lst[:]
... 
>>> a
['foo']
>>> b
['foo']
>>>