notconfusing / WIGI

Wikipedia Gender Index (WIGI), uses Wikidata to produce gender-related statistic on Wikipedia Biographies
MIT License
15 stars 6 forks source link

Factor out code in IPython notebooks in separate files #17

Open hargup opened 9 years ago

hargup commented 9 years ago

@notconfusing can you brief me about how you have generated the snapshot_data. I should be able to write a script to generate them at regular intervals.

hargup commented 9 years ago

@notconfusing some of the files you have used in the notebooks like helpers/world_cultures_shortcut.json and helpers/wiki_code_map.json are not present in this repository. Can you please add them?

hargup commented 9 years ago

I'm creating a basic python package for WIGI at https://github.com/notconfusing/WIGI/tree/hargup/refactoring. My current approach is to move recurrent pieces of code to the package, and then that code from the package to reproduce the notebook. I would like to completely decouple data retrieval, data processing and data presentation.

notconfusing commented 9 years ago

@hargup fantastic plan on decoupling all the seperate stages.

\me inhales deeply. OK, snapshot_data comes from this Java program. https://github.com/notconfusing/WIGI/blob/master/GenderIndexProcessor.java It's the thing we will have to run every week. In order to run it you need Wikidata Toolkit (WDTK). I want to get this happening on Wikimedia Labs because the ~2GB wikidata dump that it needs would be available over the local network rather than a big download. However if it helps you can just run WDTK locally for now.

BTW, When you say "package" do you mean making a "pip" package?

hargup commented 9 years ago

Yes, when I say package I mean standalone software which can installed using pip or other package managers.

notconfusing commented 9 years ago

As per #8, first focus on Gender by Culture, Gender by Country (World Map), Gender by Date of Birth, and Wikipedia Language by Gender.

notconfusing commented 9 years ago

I've created on big python script which is gender-index-processing-standalone.py that makes the graphable csv's. So I'm not sure how this affects making a pip package, or refactoring. We don't really need the ipynb's except for demonstration purposes, so I'm going to move this to phase D.