Closed jmbejara closed 6 years ago
UHRSWORKLY
, even though we are told to use it later. We should not drop UHRSWORKLY
UHRSWORKLY
and for a histogram of annual_hours
This is how I am subsetting based on GQ. I don't include GQ = 0 because the codebook says something about 0 being NIU. My comment in the HW should reflect this.
# GQ = 0 for vacant units, 1 for Households, 2 for group quarters
df = df[df.GQ == 1]
In Q15, do you mean compute three correlations? So the correlation between ave_wages
and median_wages
, ave_wages
and employment
, and median_wages
and employment
?
There will be a matrix of correlations. The entries of the matrix will have the correlations between each combination of pairs. There is a single command for this.
In Q20, are we looking to space the bins like [25, 30, 35, 40, 45, 50, 55]
? Also, is educ_bins
supposed to correspond to the codebook values? So educ_bins
should be a list of 5 elements?
Is the graph from Q16 supposed to look something like this?
Also, for Q22, how do you remove the average_wage
above the Bachelors_Degree, so that in the heatmap in the following question, it doesn't look like this:
This is what mine looks like:
This might help. Here I have run df.describe()
at various points.
At Q7:
Before Q11:
With respect to multiindexing, you can do this:
To change the order of the columns, I am doing it manually like this:
Yea, our summary statistics are diverging very mildly at Q7...will go back and double check what's going on. This is what I have right now from Q4 and Q5, respectively:
Everything here looks good to me. hmm. I don't know. What are you getting for df.describe()
at the point at which it diverges?
Q7:
I think you need to rerun your code from the beginning. My Q7 real_wage max is much larger. It looks like you have already dropped the observations described in Q10 at this point.
What is the employment variable supposed to measure? My assumption was LABFORCE, but we dropped that variable earlier on in the code.
employment was created from the variable in_labor_force
. This is because LABFORCE was a variable equal to 0,1, or 2. The variable we created was True or False (1 or 0).
Okay. Do you want it to be the average (fraction employed) or the sum (total employed?).
@jmbejara hmm, I don't know what's going on. At which question(s) did you drop the missing values? I dropped (df.dropna(axis=0, how='any')
) at the end of my code at Q4 OR at the beginning of my code at Q5.
@Jacob-Bishop I was looking for the fraction. Also, be sure to take a weighted average.
@afgong Sorry for this confusion. I have updated my code so that it only drops missing values at the specific points where I say to drop them in the problem descriptions. This changes my Q7 describe to the following:
At this point, I only drop rows at the end of Q6, calling df = df.dropna()
Sorry about this. If your answers look reasonably close, I wouldn't worry to much about this. I've instructed Philip to be generous with the grading in this regard. (Also, it's been interesting to me how little things like this can make replication so challenging.)
Thank you so much!!!
Here is a list of known typos and other improvements to make in HW 4. I haven't corrected these particular ones yet. If anyone notices any other typos, please let me know here.