vdquadros / immigration_enclave

2 stars 3 forks source link

Data checks #5

Closed vdquadros closed 5 years ago

vdquadros commented 5 years ago

First attempt at replicating Table 1

Working age population Share of US population Pct Immigrant Pct Hispanic Pct Minority Pct Dropout Pct High shchool Pct Some college Pct college or more Mean wage2
All US 206238 100.1 12 11 25 15 42 22 22 66.73
Larger czones (top 100) 115544 56.1 17 14 32 14 38 22 26 66.355
Rest of country 90694 44 4.8 6.9 16 15 46 22 17 67.209
1st largest czone 8820 4.3 41 39 58 22 33 22 23 78.18
2nd largest czone 6544 3.2 37 22 51 16 36 19 30 85.555
3rd largest czone 4510 2.2 23 17 41 14 35 21 30 66.851
4th largest czone 3245 1.6 29 17 39 11 37 18 34 69.28
5th largest czone 3021 1.5 8.8 5.9 29 11 42 19 28 67.813
6th largest czone 2989 1.5 8.2 2.9 26 11 41 24 24 67.921
7th largest czone 2906 1.4 16 6.4 16 9.1 34 20 37 62.88
8th largest czone 2821 1.4 22 9.3 42 10 30 19 42 58.961
9th largest czone 2657 1.3 33 17 46 11 28 24 37 69.416
10th largest czone 2643 1.3 25 27 49 20 36 20 24 71.92
11th largest czone 2527 1.2 13 6.7 39 13 36 21 30 61.567
12th largest czone 2238 1.1 13 5 17 8.1 35 27 30 62.136
vdquadros commented 5 years ago

Second attempt at replicating table 1

Card's MSAs that are in our dataset but are not in our top 124

Card's MSA Position in our dataset Name of the place in our dataset PMSA FIPS Code here MSA/CMSA FIPS code Do we have it under a different MSA code
9240 158 worcester, ma Worcester, MA-CT PMSA
5480 136 new haven-meriden, ct New Haven-Meriden, CT PMSA
4280 169 lexington-fayette, ky Lexington, KY MSA
3660 148 johnson city-kingsport-bristol, tn/va Johnson City-Kingsport-Bristol, TN-VA MSA
3600 253 jacksonville, nc Jacksonville, FL MSA yes
2640 195 flint, mi Flint, MI PMSA
2320 203 elkhart-goshen, in El Paso, TX MSA yes
1160 142 bridgeport, ct Bridgeport, CT PMSA

Card's MSAs that are not in our dataset at all

Card's MSA Name of the place here This location is a subset of Do we have it under a different MSA code
875 Bergen-Passaic, NJ PMSA New York-Northern New Jersey-Long Island, NY-NJ-CT-PA CMSA yes
2800 Fort Worth-Arlington, TX PMSA Dallas-Fort Worth, TX CMSA yes
2960 Gary, IN PMSA Chicago-Gary-Kenosha, IL-IN-WI CMSA yes
3640 Jersey City, NJ PMSA New York-Northern New Jersey-Long Island, NY-NJ-CT-PA CMSA yes
5015 Middlesex-Somerset-Hunterdon, NJ PMSA New York-Northern New Jersey-Long Island, NY-NJ-CT-PA CMSA yes
5380 Nassau-Suffolk, NY PMSA New York-Northern New Jersey-Long Island, NY-NJ-CT-PA CMSA yes
5640 Newark, NJ PMSA New York-Northern New Jersey-Long Island, NY-NJ-CT-PA CMSA yes
5775 Oakland, CA PMSA San Francisco-Oakland-San Jose, CA CMSA yes
5945 Orange County, CA PMSA Los Angeles-Riverside-Orange County, CA CMSA yes
7440 San Juan-Bayamon, PR PMSA San Juan-Caguas-Arecibo, PR CMSA
8720 Vallejo-Fairfield-Napa, CA PMSA San Francisco-Oakland-San Jose, CA CMSA yes
8735 Ventura, CA PMSA Los Angeles-Riverside-Orange County, CA CMSA yes

In our top 124 but not in Card's

Our MSA Name of the place in our dataset PMSA FIPS Code in here MSA/CMSA FIPS Code here Card's equivalent is
1320 Canton, OH 1320
1602 Gary-Hammond-East Chicago, IN 2960 2960
1921 Fort Worth-Arlington, TX 2800 2800
2310 El Paso, TX 2320 2320
3590 Jacksonville, FL 3600 3600
4482 Orange County, CA 5945 5945
5601 Nassau Co., NY 5380 5380
5602 Bergen-Passaic, NJ 875 875
5603 Jersey City, NJ 3640 3640
5604 Middlesex-Somerset-Hunterdon, NJ 5015 5015
5605 Newark, NJ 5640 5640
6520 Provo-Orem, UT 6520
6680 Reading, PA 6680
6960 Saginaw-Bay City-Midland, MI 6960
7361 Oakland, CA 2160 5775
7362 Vallejo-Fairfield-Napa, CA 8720 8720
7680 Shreveport, LA 7680
8730 Ventura-Oxnard-Simi Valley, CA 8735 8735
8780 Visalia-Tulare-Porterville 8780
9280 York, PA 9280

Results

City rmsa Working-age population (thousands) Share of US population Percent immigrants Less than high school College or more wage2 cwagesal
All US 175,959 100 16 12 28 68.5 39,761
Larger cities (124) 115,000 65.4 21 12 31 68.5 42,991
Rest of country 60,959 34.6 8 13 21 68.6 33,500
Los Angeles 4480 5,859 3.3 48 20 27 76.4 41,290
New York 5600 5,712 3.2 44 14 35 82.8 49,704
Chicago 1600 5,144 2.9 25 12 34 69.1 46,513
Washington, DC 8840 3,379 1.9 25 8 45 63.1 55,930
Atlanta 520 3,070 1.7 17 11 34 64.5 44,154
Philadelphia 6160 3,033 1.7 11 8 33 69.8 46,070
Houston 3360 2,919 1.7 31 19 27 70.1 41,581
Detroit 2160 2,648 1.5 11 10 27 75.2 43,831
Dallas 1920 2,532 1.4 26 18 30 65.5 42,606
Phoenix 6200 2,362 1.3 22 14 26 69.7 40,876
Riverside 6780 2,278 1.3 31 19 17 76.6 37,312
Boston 1120 2,067 1.2 22 7 46 66 52,807

About wages: Remember that our values for wage were way off Card's values? It happens that if we look at another variable for wages ("cwagesal" instead of "wage2"), the results look pretty close.

Before, I was reporting "wage2" since that's the variable that Card seems to be reporting in his code for table 1. However, if we report another variable we have for salaries, then we get a closer result. I think it's possible that Card changed the variable he is reporting without changing his code.

Note: cwagesal is the mean annual income of those who have positive annual income:

gen cwagesal = incwage 
replace cwagesal = . if incwage <= 0

where incwage is annual income (variable present in the raw dataset).

While wage2 is mean hourly wage for those who are not self employed.

gen wage = . 
replace wage = incwage/annhrs if (annhrs > 0 & incwage > 0 & incwage != 999999)

gen wage2 = wage
replace wage2 = . if selfemp==1 

Note2: I add the condition "incwage != 999999" because that's the max value that incwage can take in the dataset but this condition is not binding after keeping only obs older than 18 yo.

vdquadros commented 5 years ago

Table 2

We see that the total number of people is lower in my table than in Card's but the other values are pretty close.

Not sure why we would have fewer people since the only cuts that both Card and I do to the data are

Status Working age population (thousands) Share of all immigrants (percent) After 1980 After 1990 Mean years completed Dropouts 12-15 years College or more
Natives 130,104 13.4 14.6 60.8 24.6
Immigrants 21,824 100 71.4 41.1 11.6 38.1 38.6 23.3
By country of origin
Mexico 6,898 31.6 75.6 44.5 8.1 69.9 26.4 3.6
Philippines 1,032 4.7 66.7 31.9 14.2 9.3 44.2 46.5
India 768 3.5 80 53.9 16.4 9.6 20 70.5
Vietnam 738 3.4 76 41 11.7 34.4 45.7 19.9
China 659 3 83.2 52 14.2 24 28.7 47.4
El Salvador 652 3 85.6 37.9 8.5 65.1 30.6 4.3
Cuba 527 2.4 53.2 30.2 12.5 30.2 48.5 21.3
Dominican Republic 504 2.3 74.4 38.5 10.6 48.8 42.1 9
Canada 461 2.1 48.6 33.1 14.7 8.7 50.1 41.2
Korea 458 2.1 68.6 37.6 14.4 10.8 45.2 44
Germany 410 1.9 33.4 21.8 14.3 8.5 59.9 31.6
Jamaica 406 1.9 66.9 27.6 12.8 23.6 58 18.5
Guatemala 373 1.7 84.4 46.8 8.4 64.5 30.5 5
Colombia 364 1.7 72.2 41.4 12.6 24.9 53.1 22
Haiti 321 1.5 75.6 35 11.8 35.6 51.2 13.2
Poland 275 1.3 74.6 43.1 13.6 16.6 58.3 25
econisaac commented 5 years ago

This is pretty good.

Hmm...can you paste in the Card code that has sample restrictions?

(To be honest, the important thing is the shares).

vdquadros commented 5 years ago

This restrictions can be found in his script allnp2 under the 2000 folder, here. The output of this script, supp2000, is then imported in the script table2.

if age>=18;

if (1<=exp<=45); /*sample cut for exp*/

vdquadros commented 5 years ago

First attempt at Table 3

We can see that education, experience and employment rate are pretty close. Mean wage, however, is very far. We can also see that women have higher wages than men, which is wrong. We are also missing an inflation adjustment, but this will only make wages higher for older years.

Note that the employment rate here is only defined as the percentage of people who report having positive annual hours worked in the past year.

annhrs=weeks*hrswkly;
emp=(annhrs>0);

I have not reported the variance of residual wages because it's going to be wrong given our mean wage.

Education Experience Employment rate (%) Mean wage
Native men 1980 12.4 18.2 89.4 23.69
1990 13.1 18.4 88.3 32.59
2000 13.4 19.9 85.7 54.51
2005/6 13.7 21.4 86.2 59.03
Native women 1980 12.2 19.6 64.4 56.44
1990 13 19.3 73.7 52.21
2000 13.5 20.6 75.9 74.11
2005/6 13.8 21.8 76.7 77.02
Immigrant men 1980 11.5 18.6 86.7 27.24
1990 11.5 17.6 85.9 35.21
2000 11.5 18.3 85.3 53.70
2005/6 12.1 19.8 90.6 44.36
Immigrant women 1980 11 20.5 58.9 64.06
1990 11.3 19.8 63.4 68.55
2000 11.7 19.8 62.9 105.63
2005/6 12.3 20.9 67.2 101.40
econisaac commented 5 years ago

I copied the relevant Card code below. I have a conjecture for what is going wrong. My conjecture is that Stata treats "." as a large value and so "if wage>257.5 then wage=257.5;" when translated to Stata picks up all the missings?

See here (https://stats.idre.ucla.edu/stata/modules/missing-values/):

As you can see in the output, missing values are at the listed after the highest value 2.1. This is because Stata treats a missing value as the largest possible value (e.g., positive infinity) and that value is greater than 2.1, so then the values for newvar1 become 0.

(I guess that in life this is a good lesson that you should browse the data after you write each line of code to make sure that the code is doing what you think it is doing....).

annhrs=weeks*hrswkly; if annhrs>0 and wagesal>0 then wage=wagesal/annhrs; else wage=.;

chours=annhrs; if annhrs=0 then chours=.;

owage=wage; if (0<wage<3.8625) then wage=3.8625; if wage>257.5 then wage=257.5; logwage=log(wage);

if abs(selfinc) > 0 then selfemp=1; else selfemp=0;

wage2=wage; if selfemp=1 then wage2=.; logwage2=log(wage2);

ft=(annhrs>1400); if annhrs=0 then ft=.;

vdquadros commented 5 years ago

Second attempt at Table 3

Hi Isaac, this is much closer.

I am going to the RDC right now, so I don't have time to work more on the table, but I guess we can still improve a little. Remember that Card had created a variable called "homey" based on MSA and place of birth? That variable is used as an interaction term in the regressions to get the residual wages, but still have not create the equivalent variable for us. Until now, I am just using the "nonmover" variable in the regressions instead of "homey" (since for most observations they should be the same). Once I create that variable, I guess we can get even closer to his numbers.

Year Education Experience Employment rate (%) Mean wage Overall Residual
Native men 1980 12.5 18.8 90.1 24.93143 0.380 0.285
1990 13.2 18.9 89.3 23.64488 0.445 0.316
2000 13.4 20.4 86.8 25.36714 0.472 0.330
200506 13.7 21.4 86.2 24.77181 0.503 0.347
Native women 1980 12.2 19.7 65.3 16.78322 0.316 0.269
1990 13 19.4 74.9 17.03596 0.380 0.294
2000 13.5 20.7 77 19.45559 0.406 0.315
200506 13.8 21.8 76.7 19.52567 0.447 0.327
Immigrant men 1980 11.6 19.1 87.7 24.23933 0.434 0.327
1990 11.6 18.1 87.1 21.45033 0.500 0.357
2000 11.6 18.8 86.5 22.85249 0.545 0.415
200506 12.1 19.8 90.6 21.02426 0.527 0.358
Immigrant women 1980 11 20.6 60 17.17161 0.342 0.296
1990 11.3 19.9 65.1 16.90722 0.411 0.328
2000 11.7 20 64.8 19.16396 0.480 0.398
200506 12.3 20.9 67.2 18.31826 0.504 0.365
vdquadros commented 5 years ago

Hi Isaac,

I am going to close this issue as we are done with the Card replication for now.

To recap: In this issue we were still using data downloaded from IPUMS instead of using the Census data directly from ICPSR. As we discovered, IPUMS and ICPSR have different MSAs for the 1980/1990/2000 censuses, so the mapping of MSAs and PUMAs that Card wrote for his paper (since he used ICPSR data) was not working for the IPUMS dataset we had when doing the exercises above.

After switching from IPUMS to ICPSR, we could match Tables 2-3 exactly and get closer with Table 6, as per issue #9.