iamyogi96 / flossmole

Automatically exported from code.google.com/p/flossmole
0 stars 0 forks source link

Only first 50 users of a GC project are scraped #55

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Download a gc datamart (I used 2010-08 and 2011-11)
2. Use the following SQL:
SELECT proj_name, count( * )
FROM gc_project_people
GROUP BY proj_name
ORDER BY count( * ) DESC
LIMIT 0 , 100
3. You'll notice the first several records all have 50 users per project, this 
of course is odd.
4. Decided to dig deeper and noticed that on the google code people pages, 
tables are split per 50. My findings (below) suggest only the first page with 
contributors is scraped.

Please provide any additional information below.
<ORIGINAL STORY>
When there's more than 50 contributors/committers/owners only the first 50 are 
collected. Example:
SQL:
SELECT proj_name, count( * )
FROM gc_project_people
GROUP BY proj_name
ORDER BY count( * ) DESC
LIMIT 0 , 100

You'll notice the first listed ones all have 50 members. This seemed a bit odd 
to me. Take a look at
- elb816 http://code.google.com/p/elb816/
- the people page http://code.google.com/p/elb816/people/list
- and the second people page 
http://code.google.com/p/elb816/people/list?num=50&start=50
- The names listed on the second page:
jessieda...@gmail.com   Contributor     ----    ----
lzw89...@gmail.com  Contributor     ----    ----
palla...@gmail.com  Contributor     ----    ----
monque03...@gmail.com   Contributor     ----    ----
cdarl...@gmail.com  Contributor     ----    ----
Cyanc...@gmail.com  Contributor     ----    ----
mss...@gmail.com    Contributor     ----    ----
Chenjie0...@gmail.com   Contributor     ----    ----
qayl.2...@gmail.com     Contributor     ----    ----
kongfuf...@gmail.com    Contributor     ----    ----

SELECt * FROM `gc_project_people` WHERE `person_name` LIKE "jessieda%"
returns one row related to some robot project

SELECt * FROM `gc_project_people` WHERE `person_name` LIKE "lzw89%"
returns no rows

SELECt * FROM `gc_project_people` WHERE `person_name` LIKE "palla%"
returns 10 records, however no elb816

SELECt * FROM `gc_project_people` WHERE `person_name` LIKE "monque03%"
no rows

<...>
SELECt * FROM `gc_project_people` WHERE `person_name` LIKE "kongfuf%"
no records
</ORIGINAL STORY>

Original issue reported on code.google.com by akrukk...@gmail.com on 31 Jan 2012 at 2:55

GoogleCodeExporter commented 8 years ago
I think I remember seeing something about this before. Let me take a look. 
Friday will be the earliest I can get to this. Thanks for the bug report!

Original comment by megansquire on 31 Jan 2012 at 7:38

GoogleCodeExporter commented 8 years ago
I think I've decided to re-write the google collector. There have been enough 
changes to the page that it warrants it.

Original comment by megansquire on 22 May 2012 at 3:10

GoogleCodeExporter commented 8 years ago
All fixed. FIRST -- Sorry this took so long. (And wow, chromium-os is a really 
big project!) SECOND -- I'm having an issue getting the Teragrid (sdsc) 
computers updated as fast as the text files. However, this should be fixed in 
the next week or so. The latest datasource is 313 and the text files are 
available on our Google Code downloads page. (Teragrid uploads were messed up 
and for some reason my data was getting overwritten each time I put data up 
there. I have to fix the data move scripts. In the meantime you could use the 
text files I suppose, if you have a local database.)

here is the query & results as they stand now (notice the 313 datasource id, 
this is May 2012 data only):

SELECT proj_name, count( * )
FROM gc_project_people
WHERE datasource_id=313
GROUP BY proj_name
ORDER BY count( * ) DESC
LIMIT 0 , 100

chromium-os     701
hack4jp     434
hackathon-jp    306
kyoto-gtug  190
hacker-within-scbc  151
otwarchive  146
openetna    130
developerhappinessdays  130
catroid     125
getpaid     97
nhq-project-management  97
internship2006  95
android     93
epub-revision   92
libgdx-users    89
utexas-art-ros-pkg  89
google-summer-of-code-2008-kde  88
pharo   85
otwbingo    84
googleappengine     83
dolphin-emu     82
pencil-code     82
munich-gtug     80
nativeclient    78
google-summer-of-code-2010-asf  78
google-summer-of-code-2009-kde  76
simpleinvoices  76
s-athena    75
v8  73
jmonkeyengine   73
oryx-editor     72
svn-fog     69
ardupilot-mega  69
google-summer-of-code-2009-apache   69
fsnet   67
selenium    67
webdriver   67
sympy   66
androidteam     66
sociallearnlab  65
google-summer-of-code-2010-python   64
google-summer-of-code-2007-kde  63
google-summer-of-code-2008-psf  63
opensocial-resources    63
naclports   60
google-web-toolkit  60
elb816  60
jclouds     60
nativeclient-sdk    59
posit-mobile    59
cunruiwang-se   59
blender-translation     58
otwtest     58
jfxtras     58
dtui-k52ca  57
openspacecode   56
nutz    56
phpdoc-es   56
unicase     56
vimbook     55
mec-cs-2010-miniproject     55
cellbots    53
sfck    53
khanacademy     52
p4-fleet-09     52
moztw-web   52
opencollar  51
osacyber    51
google-summer-of-code-2008-gnome    51
google-summer-of-code-2007-gnome    50
open-ihm    50
r-u-game    50
kcpycamp    50
lilypond    49
jaikuengine     49
google-summer-of-code-2007-psf  49
fswuniceub  49
app-inventor-for-android-cs0projects    49
arducopter  49
allforgood  48
open-source-class   48
project-2010-graduacao-unipar-3ano  48
xbmc-addons     48
zingmeapis  48
gdata-python-client     48
gedemin     48
google-summer-of-code-2008-asf  47
ecug    47
image-to-uml    47
inspections-at-iscte    47
joomlajp    47
borrowing-books-information-system-of-nvc   47
claseproyectouniva  47
controle-versao-gsd     47
waf     47
srckrea     46
pyzh    46
gyp     46
gdata-issues    46
google-caja     46

Original comment by megansquire on 29 May 2012 at 5:44