The unprocessed data in csv format should go under data/(category) directory. employee.py, records.py, entryProcessor.py and main.py are in the same directory under directory "scripts."
To run in command line, I specified API as:
python main.py (directory containing csv files) (Primary skills of interest) (name of outputfile)
Where primary skills of interest are comma separated strings. (edited) Tickers of interest are inferred from file names. Additionally, you can write "all" for primary skills, or exclude specific skillsets by -skill.
Outputs: processed Employment status data and annual employment data, each in csv format under output/ directory.
Examples:
Deloitte or Pwc employees with primary skill Accounting and Auditing:
python main.py ../data/deloitte_pwc "Accounting and Auditing" dp_Accounting and Auditing
All primary skills but Accounting and Auditing:
python main.py ../data/deloitte_pwc "-(Accounting and Auditing)" dp_others
Amazon, Google, IBM and Microsoft employees with software engineering as primary skill:
python main.py ../data/tech "Software Engineering" tech
For generating Jupyter notebooks on different data, I found papermill could be our solution. I wrote a script nb_generator.py that would run the jupyter notebook on given datasets and output a PDF file with the code hidden. Try running:
python nb_generator.py dp_accounting_and_auditing.ipynb "outputs/dp_accounting_and_auditing.csv" "outputs/dp_accounting_and_auditing_by_year.csv"
Changed API so that specifying ticker is not necessary. Tickers of interest are inferred from the file names under the directory passed in as the first argument. i.e.
python main.py ../data/tech "Software Engineering" tech
To account for fact that Deloitte&PWC file is not in the same format as other data, I added splitter.py to standardize.
"Summary Statistics.ipnb" has gone through several small changes and debugging so that it will output reasonable plots for any set of data. There may still be issues because it was heavily hard-coded to fit individual data sets previously.
Jupyter notebook generating API has not changed.
Added update_edu_dpmt.py which attempts to update the education level and work department, as well as add a feature on academic faculty using science3 library. Note that I have not been able to run it fully - some unknown bug seems to prevent it from halting. The error message I got while running the debugger was:
Unable to open 'maxout.py': Unable to read file (Error: File not found science3/.egg/s/thinc-7.3.1-py3.7-linux-x86_64.egg/thinc/neural/_classes/maxout.py)).
The unprocessed data in csv format should go under data/(category) directory. employee.py, records.py, entryProcessor.py and main.py are in the same directory under directory "scripts."
To run in command line, I specified API as:
python main.py (directory containing csv files) (Primary skills of interest) (name of outputfile)
Where primary skills of interest are comma separated strings. (edited) Tickers of interest are inferred from file names. Additionally, you can write "all" for primary skills, or exclude specific skillsets by -skill. Outputs: processed Employment status data and annual employment data, each in csv format under output/ directory.Examples: Deloitte or Pwc employees with primary skill Accounting and Auditing:
python main.py ../data/deloitte_pwc "Accounting and Auditing" dp_Accounting and Auditing
All primary skills but Accounting and Auditing:python main.py ../data/deloitte_pwc "-(Accounting and Auditing)" dp_others
Amazon, Google, IBM and Microsoft employees with software engineering as primary skill:
python main.py ../data/tech "Software Engineering" tech
For generating Jupyter notebooks on different data, I found papermill could be our solution. I wrote a script nb_generator.py that would run the jupyter notebook on given datasets and output a PDF file with the code hidden. Try running:
python nb_generator.py dp_accounting_and_auditing.ipynb "outputs/dp_accounting_and_auditing.csv" "outputs/dp_accounting_and_auditing_by_year.csv"