hodsonjames / employment

Repository for code related to research projects on global employment dynamics.
MIT License
2 stars 11 forks source link

Modularization/API/PDF output for review #20

Closed aentum closed 4 years ago

aentum commented 4 years ago

The unprocessed data in csv format should go under data/(category) directory. employee.py, records.py, entryProcessor.py and main.py are in the same directory under directory "scripts."

To run in command line, I specified API as: python main.py (directory containing csv files) (Primary skills of interest) (name of outputfile) Where primary skills of interest are comma separated strings. (edited) Tickers of interest are inferred from file names. Additionally, you can write "all" for primary skills, or exclude specific skillsets by -skill.  Outputs: processed Employment status data and annual employment data, each in csv format under output/ directory. 

Examples:  Deloitte or Pwc employees with primary skill Accounting and Auditing: python main.py ../data/deloitte_pwc "Accounting and Auditing" dp_Accounting and Auditing All primary skills but Accounting and Auditing: python main.py ../data/deloitte_pwc "-(Accounting and Auditing)" dp_others

Amazon, Google, IBM and Microsoft employees with software engineering as primary skill: python main.py ../data/tech "Software Engineering" tech

For generating Jupyter notebooks on different data, I found papermill could be our solution. I wrote a script nb_generator.py that would run the jupyter notebook on given datasets and output a PDF file with the code hidden. Try running:  python nb_generator.py dp_accounting_and_auditing.ipynb "outputs/dp_accounting_and_auditing.csv" "outputs/dp_accounting_and_auditing_by_year.csv"

aentum commented 4 years ago