I think, first step should be to make a unified csv/json for the data. It'll require some data cleaning and a common schema. It'll also require uniform fields among the years. For example, for years 2005 to 2008, link contains the org website, whereas, for the years from 2009 to 2015 the link contains the GSoC page link and page contains the project ideas link which will, most likely, also contain the org link. Then in the year 2016, there is no project url except the GSoC page link.
This unification will also help in data sanitation. For example, from the file 2009-2013.json, in the following case there is an image link in about field. May be there was no about details though in this case.
"about": "\n Mailing List: http://freifunk.net/mailman/listinfo/wlanware\n ",
"link": "https://www.google-melange.com/archive/gsoc/2009/orgs/ffopenwrt",
"mail": "//wiki.freifunk.net/Ideas",
"name": "ffopenwrt",
"page": "License: GNU General Public License (GPL)"
It'll also help to get a separate year flag. I think, it can be parsed out from the google melange urls but still it'll help to have a separate entry for the GSoC year.
For each org's website, we can see how many are .edu or .org or other cases. We can get a rough idea of the number of universities taking part in the process. How many use wiki to host their ideas or google docs pages.
Email domain name analysis
Same as above. There will be some cleaning and standardization required though.
Nature of org
Word frequency analysis (after removing stopwords), wordclouds on the org about details. This will give you a sense of areas in which the orgs work in.
Yearly trend of words from the about details of the orgs. This will give the emerging areas in open source and tech domain in general.
Topic wise orgs can also be analysed. Topic labeling will probably be manual based on some keywords. Some interesting questions which can be explored:
How many organizations work in education, finance, video, linux, ML, Science, Android.
How many python modules have their orgs.
Home many programming lang have their own orgs and how have their projects and students grown over the years.
Projects
Nature of project (Similar to nature of org analysis)
Project title and details analysis. (iCHAIT covered this)
Projects X Orgs
Student retention rate
For how long, students remain with a single org. This will only be done for the students having more than 1 GSoCs.
What is retention like? Do they only remain as a student or do they also become a mentor?
Student conversion
Conversion of student into a mentor
After how many GSoCs a student converted
Creating and visualizing the org to project to mentor to student graph.
I think, first step should be to make a unified csv/json for the data. It'll require some data cleaning and a common schema. It'll also require uniform fields among the years. For example, for years 2005 to 2008,
link
contains the org website, whereas, for the years from 2009 to 2015 thelink
contains the GSoC page link andpage
contains the project ideas link which will, most likely, also contain the org link. Then in the year 2016, there is no project url except the GSoC page link.This unification will also help in data sanitation. For example, from the file
2009-2013.json
, in the following case there is an image link inabout
field. May be there was no about details though in this case.or this entry-
It'll also help to get a separate
year
flag. I think, it can be parsed out from the google melange urls but still it'll help to have a separate entry for the GSoC year.In addition to what @iCHAIT suggested in https://github.com/mzfr/GSoC-Data/issues/6:
Orgs
Domain name analysis
For each org's website, we can see how many are
.edu
or.org
or other cases. We can get a rough idea of the number of universities taking part in the process. How many usewiki
to host their ideas or google docs pages.Email domain name analysis
Same as above. There will be some cleaning and standardization required though.
Nature of org
Projects
Nature of project (Similar to nature of org analysis)
Project title and details analysis. (iCHAIT covered this)
Projects X Orgs
Student retention rate
Student conversion
Creating and visualizing the org to project to mentor to student graph.
Have fun. :smiley: