hodsonjames / employment

Repository for code related to research projects on global employment dynamics.
MIT License
2 stars 11 forks source link

Summary Statistics #7

Open c-forrest opened 5 years ago

c-forrest commented 5 years ago

Also, break these down by the following attributes available in the data: (a) birth year/buckets, (b) secondary skills, (c) education level/eliteness/major/department, (d) gender, (e) country, (f) next job industry, (g) others you think might be interesting.

c-forrest commented 5 years ago

@hodsonjames

Hi James,

I am writing to ask for further clarification on the definition of "hiring/firing". The code can be found in my branch.

Some examples of the accountants' career timelines are shown in the output of the last block of the sheet. As you can see, the timeline is not very clear: there may be overlaps or gaps between two consecutive entries in term of dates. When we consider the invalid month records, things get even worse.


Here are some examples:

(Only employment entries are kept) (Variables: row_id, start_date, f_valid_start, end_date, f_valid_end, f_current, ticker) ['f' for 'flag']

1 2002-09-01 True 2003-02-01 True False D 2 2003-02-01 True 2004-07-01 True False D 3 2004-09-01 True 2005-02-01 True False D 4 2005-09-01 True 2008-09-01 True False D 5 2008-08-01 True 2013-07-01 True False D

A large gap occurred between entry 3&4. A small overlap occurred between entry 4&5. Should the gap be identified as a leaving followed by a re-entering or just be ignored? If it is the latter, then how should we calculate his tenure?

0 2014-06-01 True 2014-08-01 True False P 1 2015-03-01 True 2015-03-01 True False
2 2016-08-01 True 2017-07-01 True False
3 2017-10-01 True None False True P

An example of entering PwC twice. I think there is no doubt to regard it as two separate records. Am I correct?

A similar example: 0 2000-02-01 True 2004-01-01 True False D 1 2004-01-01 True 2009-02-01 True False P 2 2009-02-01 True 2010-03-01 True False P 3 2010-11-01 True 2013-07-01 True False
4 2013-07-01 True 2013-11-01 True False D

0 None False None False True P 1 2006-06-01 True 2011-04-01 True False
2 2011-09-01 True None False True P

Here is a problematic example with no time available for the first entry. Generally speaking, the "None" for the end date of the last (current) entry can be regarded as "today" or more specifically Nov/Dec, 2018. But how should we deal with the "None" for the start dates or for the end dates not in the last entry?


Please let me know what you think of these problems. Looking forward to your ideas.

Best, Honghao

PS: I am new to GitHub. Let me know if what I did is not the right way to use this platform. Also let me know which way to discuss such kinds of problems you prefer, by commenting here, by emails or by something else, or if you prefer not to discuss these details until the coding is done.

hodsonjames commented 5 years ago

@c-forrest

Hey Honghao,

I think this is a great way to use github to discuss!

Generally speaking the jobs may overlap or have gaps between them. This might be due to one of the jobs not being "full-time employment"--i.e. a board position, a temporary teaching position, volunteering, etc. I have tried to infer when building these enriched profiles what the correct ordering should be, but often full time work may not look adjacent due to some overlapping activity with strange dates. This is difficult to fix, but hopefully won't occur in most instances.

For gaps between employment, I thought I was generating "TIME_OFF" records that filled the time. Perhaps someone messed with the code and it got bypassed... In any case, time off between jobs is fine. We can use a 3 month+ gap as a proxy for 'firing' for now, and see whether we get significantly different behaviour/trends among the fired vs voluntary leave groups.

For your example of leaving PwC and then joining again later. This will happen if people do internships and then end up working there later, or may encompass things like a MBA, and of course some people leaving and coming back (this is rare in my experience). So, yes, let's treat them as separate events for our analysis.

The instance with None dates. Yes, that's annoying, but sometimes people with a long career will note their early career as a block with no dates. We'll just have to ignore these. If there are no dates it means my semi-sophisticated date inference engine didn't work to guess them...

Thanks for getting on this so quickly, and for the insightful questions! James

hodsonjames commented 5 years ago

Just checked the code--looks good for the initial exploration!

I noticed that TIME_OFF is there, so I am not insane :)

Good to include a link to the nbviewer app when sharing ipynb's, so that they are quickly rendered:

https://nbviewer.jupyter.org/github/hodsonjames/employment/blob/honghao_branch/tech-jobs/Summary%20Statistics.ipynb

James

c-forrest commented 5 years ago

@hodsonjames

Hi James,

Thanks a lot for your answers and advice. I have finished the codes for reading the raw data and have been coding for descriptive plots. Please find the updated code here.

I have no specific questions for now. I just want to check if I am on the right way. There are lots of ways to plot the data, so I am always hesitating what to plot. Therefore, please let me know if there is any more concrete instructions. Please also let me know if you find anything wrong with the current treatments.

Thanks, Honghao

hodsonjames commented 5 years ago

Hi Honghao,

Thanks for sending this along so promptly--this is a great start!

The plots are already quite informative, but I would ask for the following:

I will set up a call with the group once we have these additional plots, and we can see where to go next.

Thanks! James

hodsonjames commented 5 years ago

Also, good job noticing that many profiles are being erroneously given 2000 as a birth date, when there is not enough information to suggest otherwise...

hodsonjames commented 5 years ago

Another thing, I didn't look thoroughly, but it looks like you call 'firing' anyone who leaves the firms. If this is the case, then let's call this 'leaving', and let's add an additional analysis that looks at instances that have the 'TIME_OFF' following them, calling these 'firing'.

c-forrest commented 5 years ago

@hodsonjames

Hi James,

Thanks for your advice. I have updated the codes here. Please take a look and let me know what to adjust next. Then I will duplicate and modify the codes to generate the summary statistics for all positions in Deloitte and PwC.

According to your advice, I adjusted the plots to the proportions to last year's employment. I also changed the pie charts for categorical variables to be area plots over time. Thus now we can identify the changes in their composition.

Besides, I adjust the entry classification so that leaving and firing are now two separate types and accordingly, all summary statistics are done separately. If you would like to take them as a whole, please let me know.

I also include all the variables given in your first email. However, I did not extract majors from the raw data. This is mainly because they are ill-organized - not classified and even written in different languages. It is put aside for now and can be seen as a separate task to do next if necessary.

I also tried the interaction of age and gender. Let me know if you are interested in more interactions. I can show them then.

Let me know if you have any other question.

Best, Honghao

hodsonjames commented 5 years ago

Honghao,

This is really great work--really appreciated!

There should be extracted variables for education department and subfield for many of the profiles, e.g. 'Social Science'. Did you notice these, or are there too few to be useful? I was thinking of using those as the majors indicator, but I have to admit the modelling work I did to extract these was fairly preliminary, so may not be as useful as I had hoped.

Effi, Anastassia, Vlad, and I will meet tomorrow to discuss the project and the plots you made. Once we have had a chance to review we can see what the best next steps should be. To me, so far, this is telling a story of an industry that is restructuring the way it does business (i.e. business development, and product management team focussed), rather than an automation story. The workforce doesn't appear to be getting any more 'intelligent' over time. Maybe others have more insight. Also, do let us know if you think you noticed anything particularly noteworthy in the data...

Best, James

c-forrest commented 5 years ago

Hi James,

As promised, I updated the results for all positions here with secondary skill replaced by primary skill. The results are largely similar to that for only accountants.

Then the major variables. The variables which may include major information can only be column 17 or 18, namely, Role or Department. For the education records in the first 10000 entries, column 18 is always blank. Column 17 is the name of its degree, which is ill-organized as said previously. For example, there are broad descriptions "Master", "Bachelor of Arts", and also all kinds of languages "Contador Público, Ciencias Económicas", "Wirtschaftsingenieurwesen", "סטטיסטיקה, מחשבים ותנ". It is hard to clear them. Please let me know if you have any advice on that.

I did not find some particularly notable stuff from the data. However, I would just like to mention my concern that now the plots may be noised by some factors. Maybe you have noticed them or I have misunderstood what you are going to do. If so, just ignore them:

Best, Honghao

AnastassiaFedyk commented 5 years ago

Hi Honghao,

Thank you for these results! They look great so far.

To help us get a better understanding of some of the patterns, could you please take a look at the following?

Thanks a lot for the great work, and I look forward to seeing the next results! Anastassia

hodsonjames commented 5 years ago

Honghao doesn't have access to the underlying lists of skills. I will try to get these from the data provider over the coming week.

AnastassiaFedyk commented 5 years ago

Thanks, James! That would be great.

In the meantime, Honghao, please take a look at other characteristics of these employees (e.g., job roles, education, etc.) as a starting point for the third bullet item.

James, for the underlying lists of skills, should we just get those for a few "representative" profiles that Honghao identifies and looks into based on other characteristics? There is probably no need to make the dataset more cumbersome with free-form text entries of individual skills at this point.

hodsonjames commented 5 years ago

Yes, we can do that too.

c-forrest commented 5 years ago

@AnastassiaFedyk

Hi Anastassia,

Thank you for your instructions. I will do them as soon as possible. Here are some questions that I hope that you can clarify further:

Thanks, Honghao

c-forrest commented 5 years ago

@AnastassiaFedyk

Hi Anastassia,

Please take a look at the updated file here. Here are two main changes:

Please let me know what changes for the representative profiles presentation you would like to make and for what other parts you also need the representative profiles. I will then do the same thing to the all-position statistics for the primary skill categories after you are satisfied with the presentations.

Also, let me know any other question on that.

Best, Honghao

hodsonjames commented 5 years ago

Hi Honghao,

Yes, you are correct, we should exclude both the -1 skill entries, and the 'Accounting and Auditing' skill. In terms of the representative profile, we are just looking for an easy way to take a look at the kinds of people we are capturing with each skill. This is a sanity check. I think your latest link does this well enough.

A quick note on the job title and department columns. These are actually comma-separated lists. It might be good to treat them as such, to avoid the multiple job titles appearing in the sample profiles. The first entry in the job title list is the normalised full title, followed by additionally normalised extracted entites, e.g. 'Research Manager' will also extract 'Manager' as a normalised entity of interest. Each department listed should be treated separately rather than as a unique key.

I am in the process of extracting data for a couple more industries for you to push through a similar process, to see whether the same thing that is happening to accountants is happening to key roles in other industries (e.g. banking). I will send this as soon as ready.

Best, James

hodsonjames commented 5 years ago

Hi Honghao,

In the data share you will find a zip file called 'banks_small', which contains profiles from 5 mid-size banks in the US (Bank of Hawaii, Synovus, Silicon Valley Bank, Fifth Third Bank, and Umpqua Bank). Each bank's profiles is in a separate file, and follows the same format as before. Employment at a particular bank can be identified by the ticker/exchange column (i.e. do not rely solely on the name, since there may be acquired banks that confuse the column).

We would like to perform a similar analysis to the accountants at PwC/Deloitte, except focussing on the 'Banking and Finance' skill as the primary.

Let me know how you get on with the quick fixes from the previous message, and if there is any confusion as to this new data/task. In the coming days we will further explore 'big banks' and 'big tech' companies as well.

Thanks! James

c-forrest commented 5 years ago

@hodsonjames

Hi James,

Thanks for your instructions. Because of some more urgent stuff, I will do the fixes you pointed out this Saturday. It will be quick. I will then turn to the new dataset when it is available.

Best, Honghao

c-forrest commented 5 years ago

@hodsonjames

Hi James,

I have updated the codes for accountants and for other positions. Please kindly find them. Let me know if you have more requirements.

Best, Honghao

hodsonjames commented 5 years ago

Hi Honghao,

The new data is in the shared folder with everything else. You can get started with it, and we will figure out the agreement in the next days. As mentioned above, this is the set of 'small banks'.

On the accountants--I noticed that the job title column hasn't been split yet--can you just take the first entry of that comma-separated list for our purposes?

Can you plot the average accountant tenure, and number of promotions within the firm? I would expect that as accountants are getting older at each firm, they would have a longer tenure in recent years, and I am curious to see whether they are being promoted.

Can you explain quickly what the "Gender Availability" chart is showing us?

Can you limit all secondary skills charts to only include secondary skills that have a weight above 10%?

The same for the other positions: only include people with secondary skill above 10% weight...

Thanks! James

c-forrest commented 5 years ago

@hodsonjames

Hi James,

I have found the new datasets and will work on them then. I also signed the agreement just now - just let you know.

I have split the job titles when counting the occurrence for each normalized title, as shown in Top 10 entries for each variable with proportions. Or you are actually meaning also only to show the first entry in the examples below?

I will take care of the tenure and promotions stuff.

Gender can be 0, 1, or 2, where 0 means no available gender information. So the "Gender availability" only shows the proportion of entries with gender info available.

For the secondary skills stuff, as I understand, you are saying to reduce the number of categories shown in each chart such that each shown secondary skill category has a weight above 10%, rather than only do summary statistics for a smaller dataset excluding those categories weighted below 10%. Correct?

I will first do them as I understand. Let me know if I misunderstand something and I will correct them then.

Best, Honghao

c-forrest commented 5 years ago

@hodsonjames

One more question: There are only two skill2 with a weight above 10%. Would you still like to set such a high threshold? Please let me know.

Secondary Skills Percentage
Banking and Finance 26.39
Business Development 13.37
Administration 9.82
Middle Management 8.25
Legal 5.22
Operations Management 4.87
Industrial Management 4.81
Sales Management 3.48
Human Resources (Junior) 2.86
Insurance 2.75
Human Resources (Senior) 2.65
Product Management 1.90
Data Analysis 1.74
Manufacturing and Process Management 1.45
Recruiting 1.38
Technical Product Management 1.27
CRM and Sales Management 0.97
Logistics 0.75
Real Estate 0.62
Energy, Oil, and Gas 0.58
Public Policy 0.55
Non-Profit and Community 0.55
Military 0.45
IT Management and Support 0.31
Construction Management 0.29
Retail and Fashion 0.29
Sales 0.28
Education 0.22
Social Media and Communications 0.22
Hospitality 0.22
Personal Coaching 0.18
Pharmaceutical 0.18
Healthcare 0.18
Musical Production 0.16
Digital Marketing 0.16
Mobile Telecommunications 0.12
Web Development 0.11
Video and Film Production 0.10
Graphic Design 0.07
Visual Design 0.06
Web Design 0.06
Software Engineering 0.05
Electrical Engineering 0.04
hodsonjames commented 5 years ago

Sorry Honghao, I wasn't clear. I meant, only add an instance of skill2 if that person has a higher weight than 10% on their skill2. So, it's not a measure of the proportion of each skill as a skill2, but rather when the person's weight for their skill2 exceeds the threshold. Does that make more sense?

c-forrest commented 5 years ago

@hodsonjames

Hi James,

Thanks for your reply. I have done the adjustments and will upload them later. Now I am dealing with the small banks' data. I encountered a problem associated with the merger/acquisition of banks. Here are some examples:

As far as I am concerned, all employment history in a bank and all banks it once acquired should be recorded. Otherwise, there will be inconsistency in the records before and after the acquisition. If so, could you please provide a list of tickers that should be included in our analysis for each of the five banks' files. However, if the data we have did not include the acquired bank's data before the acquisition, maybe we need to consider other ways to deal with them.

Looking forward to your reply.

Thanks, Honghao

hodsonjames commented 5 years ago

Hi Honghao,

Thanks for the questions!

Actually, it looks like my processing of these files did not work as anticipated. I just uploaded new versions. Please take the new banks_small.zip instead. You should now be able to ignore the company name--the ticker should be properly compiled...

For the Umpqua question. You can ignore this, and treat the change of ticker (if there is one) as the ground truth.

Best, James

hodsonjames commented 5 years ago

I am going to upload replacement files for all verticals (banks_big, tech), so please re-download these when you get to them :)

c-forrest commented 5 years ago

@hodsonjames

Hi James,

Thanks for your quick responses and updated datasets. Here are the codes respectively for accountant, other positions in Deloitte and PwC data and all positions in small banks data.

Here are some noteworthy points when dealing with the small banks' data:

Let me know if you have any advice on the plots. In the meantime, I am working on large banks and other datasets available. They will be done soon.

Best, Honghao

hodsonjames commented 5 years ago

Thanks Honghao,

Could you update the text in the "all positions" file? It looks like it is still talking about "all accountants", so might end up being confusing later when we re-read it.

I made a mistake when processing the data for the banks and tech. I was off by one on the ticker assignment. I just uploaded the data again. Please update one last time (hopefully)!

Best, James

c-forrest commented 5 years ago

@hodsonjames Hi James,

Here is the list of links to the reports:

Let me know if you have any other requirements on these reports.

You may also want to see the employment changes exclusively for some primary skill group in the banks and tech companies. This is possible for those big banks and tech companies. If so, please let me know the details.

Best, Honghao

hodsonjames commented 5 years ago

Hi Honghao,

Thanks for this!

Some next steps to keep us moving forward:

Lastly, I am about to open a separate issue for a new task, related to these, but this thread is getting long, so would be good to keep the conversations separate.

Thanks! James

c-forrest commented 5 years ago

@hodsonjames

Hi James,

I am working on the small modifications you listed. Here lists the progress and my comments or questions:

  • [x] For BOH, the employment chart looks strange. In the data I sent you, I am able to count several hundred employees in the latter years, but the chart shows around 100 only. Could you check you are using the latest files and they are being parsed correctly? Also good to check for big banks and tech.

I spent a good deal of time on this. It turned out that I added a filter when reading the data: I dropped all entries without primary skill available (i.e., "-1"). This accounted for most of the difference between my counts around 100 and your counts over 500.

However, when I tried to figure out the reasons, I found another problematic case. There is possibly an employment record without valid entering date but with valid exiting date. Since I counted employment through subtracting cumulative hiring by cumulative leaving (firing included), I actually underestimated the employment due to the case above. Though finally, it showed that this only affected a little, it did eliminate the unreasonable fluctuations in the BOH curve -- the sample itself was not large.

To fix this, I added a new block of codes to count the employment independently from recording the employment changes. Here I excluded the observations without a valid start date or those without a valid end date and not identified as the current job. It is expected that at this time, the employment data is more accurate.

  • [x] The month composition plots look very crowded--difficult to read which plot refers to which bank--can you split them up?

Yes, I have done this.

  • [x] Please focus on the primary skill "Banking and Finance" for the banks, and "Software Engineering" for the tech companies, so we can see the trends for these types of key roles.

Ok. But I am still hesitating to do this for small banks. This will drop more than half of the observations. Please let me know if you still want this anyway.

  • [ ] Can we filter out interns from the analyses? For our questions, interns are a source of noises. Alternatively, if you can plot them separately, we may see interesting trends in the "human capital of interest" to these firms.

I will keep trying to do so. However, a preliminary test shows that if we only remove those entries with job titles containing "intern", "internship", "trainee" or so, the results will not be changed much. The removed entries are of about 10k person-years while the original sample is of 770k person-years.

I am wondering if you have better filters for this. A rude filter may be dropping those too short experiences, but it seems too rude.

  • [x] There seems to be a typo in one of the departments, which leads "Technology" and "Tehnology" to be separate--could you group these?

Sure, I have done this.

  • [ ] For the job titles, are you using just the first comma-separated entry, or all of them? I see some odd instances like "Senior". We should only be using the first entry of those lists.

Sorry if I misunderstood you, but as I understood, you are saying using "Special Assets Officer" from "Special Assets Officer, officer" instead of "officer" when counting its frequencies. If I am correct, I am concerning that the wording may be too varied to aggregate. Now since every job role only has one or two observations, the top 10 job role becomes meaningless. That is why I used the whole list instead while weighting each separate strings equally with a sum equal to one.

Maybe in order to make it more comparable if we insist on using the first entry in the job title list, I thought we might need to take some time to clean the strings, like cleaning the parentheses, the abbreviations, or the foreign languages. Please let me know if you think it is necessary to do so.


Looking forward to your ideas. I expect this can be done by this weekend and we can quickly turn to the next task.

Best, Honghao

hodsonjames commented 5 years ago

Hi Honghao,

Apologies for not seeing this earlier. Thanks for the email reminder!

Ok, I agree with you on the small banks being too small to get good results after segmentation/filtering, so let's leave that as is for now. It also makes sense what you said about interns if they make up only a small portion of the observations--just use the simple filter as you mentioned and we can live with some noise remaining in the data.

On the job titles, let's use the first entry alone without further cleaning, and see what the results look like. For these big companies I do expect that the top 10 job titles will still be reasonable, but you are right that we'll have a long tail of 50-60% of the sample with only a couple of observations each. If it looks like we need to dig into these to get reasonable results, we can look to get some quick wins.

Were you able to start on the other task in the separate issue in the meantime?

Thanks! James

c-forrest commented 5 years ago

@hodsonjames Hi James,

Please ignore my last comment this morning. I just found a mistake in counting the job title occurrences and corrected it.

Here are the links to the newly updated files:

Please let me know if you have any question on them.

I have just begun working on the other issue, for initially, I thought the job titles might need more cleaning. Now it turned out to be caused by errors in my codes. I apologize for this and I will make up for the time lost. It will be done soon.

Best, Honghao

AnastassiaFedyk commented 5 years ago

Hi Honghao,

Could you make the following adjustments to the summary statistics files?

  1. Small banks: please combine the data for all firms in this category. Some of the firms have few employees, which makes the charts look quite noisy. So for these smaller firms, we want to see how the data looks when it's aggregated together.
  2. For both small banks and big banks: please add a separation based on primary skillsets for all employees (in addition to the current look at secondary skillsets for those with "Banking & Finance" as their primary skillset).
  3. It looks like there is a decrease in hiring of 20-24-year-olds by the big banks, which could potentially be interesting to dig a bit deeper into. Could you please take a look at what kinds of jobs the 20-24-year-olds were being hired into previously that are "disappearing" now?

There is one other task I would like for you to take a look, but it's a bit more general, so I will open a separate issue for that one.

Thank you! Anastassia

c-forrest commented 5 years ago

@AnastassiaFedyk Hi Professor Fedyk,

Here are the links to the files, some of them have been changed according to your requirements:

For your Point 3, please refer to this link for a detailed presentation, where I show the changes for the top positions that are most common among 20-24-year-olds.

It turns out that this change mainly happened over the recent 10 years. I also consider the decomposition of this change into the change in the age composition in all newcomers in a position and the change in the proportion of the newcomers in that position in all newcomers at that time.

In my opinion, both components decreasing possibly suggests that this position is vanishing, while the first component decreasing with the second component increasing implies that this position now prefers the seniors more.

Note that the change may also be attributed to the poor coverage of the newcomers in recent years or more accurate age computation for recent years or so.

Let me know any question or comment on this.

Best, Honghao

hodsonjames commented 5 years ago

Thanks Honghao, apologies for the slow response!

The plots for the 20-24 year olds look very interesting--thanks for putting those together. I also see the updates for the banks.

WOuld you be able to go through and make sure each ipynb file makes it very clear at the top and throughout which dataset it is referring to? Otherwise we rely on the text of the link you provide to remember which file is which.

Thanks!

c-forrest commented 5 years ago

@hodsonjames Hi James, I just added a floating title to each file through modifying their HTML sources. Please find them using the links above. Hope this can solve the problem thoroughly. Best, Honghao

hodsonjames commented 5 years ago

Thanks!