hodsonjames / employment

Repository for code related to research projects on global employment dynamics.
MIT License
2 stars 11 forks source link

argparse, filtered data analysis, top skillset extraction #25

Open aentum opened 3 years ago

aentum commented 3 years ago
  1. I adopted the argparse library as you recommended. It should have made the input format more clear. There are also commentaries throughout the code.
  2. I analyzed why I was getting so few employees in the final output file (by_year.json). First off, there were many lines with '-1' as their primary skill. Since primary skill was a major filter in the employee analysis, I treated everyone with unknown primary skill as having not enough information. Turns out, there is a significant portion (63%) in the data without specified primary skill. I played around just a little thinking about assigning primary skills to these lines with some classification method, but there were also some ambiguities there in what is appropriate.

Next, I also found many lines without any experience marked with "AMZN" in the identifier. As you told me, each line in the Amazon employee data source you had given me for testing corresponds to an employee that should have worked in Amazon at some point. But this data was missing from about 7% of the lines. I couldn't really find a pattern in these.

  1. Extending on the ai/non-ai proportion graph, I have been trying to produce the graph that you had requested: yearly changes in the top skill composition of the company. I successfully extracted the top skills by each year, The processed data looks like this, showing top 10 skills and number of employees in the company at the time corresponding to the skills. image (1)

But I still cannot figure out how to get the graph to look like what I visualized... (something like below is what I had in my mind). I wanted to make the pull request after resolving either 2) or 3) but both have loose ends right now and I ended up pushing it off way too much... The program itself should run fine on the raw json data in the server.  IMG_4393