Sampling technique doesn't work

dpiponi commented 13 years ago

In twitter__util.py, line 107, the code lst = lst[:int(len(lst) * sample)] fails to trim the lists as intended because it's assigning the trimmed list to lst, not screen_names or user_ids.

ptwobrussell commented 13 years ago

Thanks for noticing this. It looks like you are indeed correct. I think this slipped through during a final pass of refactoring. Fortunately, it didn't make it into print (twitter__util.py isn't in print). I'll make a note to fix this in the next day or so. In the meanwhile, feel free to send me a pull request if you've already fixed it and made any other improvements. Thanks again.

ptwobrussell commented 13 years ago

Just getting around to cleaning house. Sorry for this delay. After taking a look at the issue more closely, I don't think there is actually a bug with lines 104-107 in twitter__utill.py after all. Lines 104-107 follow for convenience:

if sample < 1.0:
    for lst in [screen_names, user_ids]:
        shuffle(lst)
        lst = lst[:int(len(lst) * sample)]

What's happening is that lst here is a reference to screen_names and user_ids and the shuffle() and assignment operations that take place in the body of the loop on lst are passed through to screen_names and user_ids. You can check this out for yourself in the interpreter with a similar test case to see that the changes to lst pass through via the references:

>>> a = []
>>> b = []
>>> for lst in [a,b]:
...     lst.append('foo')
...     lst = lst[:]
... 
>>> a
['foo']
>>> b
['foo']
>>>

ptwobrussell / Mining-the-Social-Web

Sampling technique doesn't work #6