mzfr / GSoC-Data

GSoC Data from 2005 to 2018 in JSON format.
35 stars 8 forks source link

Exploratory Analysis #7

Open TrigonaMinima opened 6 years ago

TrigonaMinima commented 6 years ago

I think, first step should be to make a unified csv/json for the data. It'll require some data cleaning and a common schema. It'll also require uniform fields among the years. For example, for years 2005 to 2008, link contains the org website, whereas, for the years from 2009 to 2015 the link contains the GSoC page link and page contains the project ideas link which will, most likely, also contain the org link. Then in the year 2016, there is no project url except the GSoC page link.

This unification will also help in data sanitation. For example, from the file 2009-2013.json, in the following case there is an image link in about field. May be there was no about details though in this case.

    "about": "[IMAGE http://gephi.org/wp-content/themes/gephi/images/logo80.jpeg]",
    "link": "https://www.google-melange.com/archive/gsoc/2009/orgs/gephi",
    "mail": "//gephi.org/forum/",
    "name": "gephi",
    "page": "http://gephi.org/forum/topic.php?id=408"

or this entry-

    "about": "\n    Mailing List: http://freifunk.net/mailman/listinfo/wlanware\n  ",
    "link": "https://www.google-melange.com/archive/gsoc/2009/orgs/ffopenwrt",
    "mail": "//wiki.freifunk.net/Ideas",
    "name": "ffopenwrt",
    "page": "License: GNU General Public License (GPL)"

It'll also help to get a separate year flag. I think, it can be parsed out from the google melange urls but still it'll help to have a separate entry for the GSoC year.

In addition to what @iCHAIT suggested in https://github.com/mzfr/GSoC-Data/issues/6:

Orgs

  1. Domain name analysis

    For each org's website, we can see how many are .edu or .org or other cases. We can get a rough idea of the number of universities taking part in the process. How many use wiki to host their ideas or google docs pages.

  2. Email domain name analysis

    Same as above. There will be some cleaning and standardization required though.

  3. Nature of org

    • Word frequency analysis (after removing stopwords), wordclouds on the org about details. This will give you a sense of areas in which the orgs work in.
    • Yearly trend of words from the about details of the orgs. This will give the emerging areas in open source and tech domain in general.
    • Topic wise orgs can also be analysed. Topic labeling will probably be manual based on some keywords. Some interesting questions which can be explored:
      • How many organizations work in education, finance, video, linux, ML, Science, Android.
      • How many python modules have their orgs.
      • Home many programming lang have their own orgs and how have their projects and students grown over the years.

Projects

  1. Nature of project (Similar to nature of org analysis)

    Project title and details analysis. (iCHAIT covered this)

Projects X Orgs

  1. Student retention rate

    • For how long, students remain with a single org. This will only be done for the students having more than 1 GSoCs.
    • What is retention like? Do they only remain as a student or do they also become a mentor?
  2. Student conversion

    • Conversion of student into a mentor
    • After how many GSoCs a student converted
  3. Creating and visualizing the org to project to mentor to student graph.

Have fun. :smiley: