Data cleaning - Githubissues

vdquadros commented 5 years ago

1. Defining who is immigrant

Card:
Defines as immigrant people who were naturalized citizen or who are still not citizens.

citizen='0=us born 1=nat 2=not cit 3=born abroad us parents'
imm=(citizen in (1,2))

Victoria: Card + 4 + 5

 /* CITIZEN:
           0 n/a
           1 born abroad of american parents
           2 naturalized citizen
           3 not a citizen
           4 not a citizen, but has received first papers
           5 foreign born, citizenship status not reported
*/
gen imm = .
replace imm = 1 if citizen == 2 | citizen == 3 | citizen == 4 | citizen == 5

2. Hours worked last year

Card: His data seems to have the exact number of weeks people worked last year. His code is the following:

annhrs=weeks*hrswkly;

That is, total annual hours = weeks * weekly hours

Victoria: The Bartik data has multiple bins for number of weeks worked last year.

WKSWORK2:
           0 n/a
           1 1-13 weeks
           2 14-26 weeks
           3 27-39 weeks
           4 40-47 weeks
           5 48-49 weeks
           6 50-52 weeks

Thus, I am currently getting an average of those.

gen weeks = .
replace weeks = 0 if wkswork2 == 0
replace weeks = 7 if wkswork2 == 1 
replace weeks = 20 if wkswork2 == 2
replace weeks = 33 if wkswork2 == 3
replace weeks = 43.5 if wkswork2 == 4
replace weeks = 48.5 if wkswork2 == 5
replace weeks = 51.5 if wkswork2 == 6

3. Education labels

Card: Census data has exactly one category for each grade and it also has information on whether the person completed the grade.

GRADE             2     40                                                 
                            Highest Year of School                              
                              Attended                                          
                  00        Never attended school or N/A (under 3               
                              years of age)                                     
                  01        Nursery school                                      
                  02        Kindergarten                                        
                            Elementary:                                         
                  03          First grade                                       
                  04          Second grade                                      
                  05          Third grade                                       
                  06          Fourth grade                                      
                  07          Fifth grade                                       
                  08          Sixth grade                                       
                  09          Seventh grade                                     
                  10          Eighth grade                                      
                            High school:                                        
                  11          Ninth grade                                       
                  12          Tenth grade                                       
                  13          Eleventh grade                                    
                  14          Twelfth grade                                     
                            College:                                            
                  15          First year                                        
                  16          Second year                                       
                  17          Third year                                        
                  18          Fourth year                                       
                  19          Fifth year                                        
                  20          Sixth year                                        
                  21          Seventh year                                      
                  22          Eighth year or more

Victoria: Bartik data has too many categories and the numbers don't really add up:

EDUCD:
           0 n/a or no schooling
           1 n/a
           2 no schooling completed
          10 nursery school to grade 4
          11 nursery school, preschool
          12 kindergarten
          13 grade 1, 2, 3, or 4
          14 grade 1
          15 grade 2
          16 grade 3
          17 grade 4
          20 grade 5, 6, 7, or 8
          21 grade 5 or 6
          22 grade 5
          23 grade 6
          24 grade 7 or 8
          25 grade 7
          26 grade 8
          30 grade 9
          40 grade 10
          50 grade 11
          60 grade 12
          61 12th grade, no diploma
          62 high school graduate or ged
          63 regular high school diploma
          64 ged or alternative credential
          65 some college, but less than 1 year
          70 1 year of college
          71 1 or more years of college credit, no degree
          80 2 years of college
          81 associate's degree, type not specified
          82 associate's degree, occupational program
          83 associate's degree, academic program
          90 3 years of college
         100 4 years of college
         101 bachelor's degree
         110 5+ years of college
         111 6 years of college (6+ in 1960-1970)
         112 7 years of college
         113 8+ years of college
         114 master's degree
         115 professional degree beyond a bachelor's degree
         116 doctoral degree
         999 missing

To see what I mean by "they don't really add up", consider:

educational attainment [detailed version]	Freq.	Percent	Cum.
grade 1, 2, 3, or 4	59,078	35.54	35.54
grade 1	9,674	5.82	41.36
grade 2	21,079	12.68	54.04
grade 3	37,034	22.28	76.32
grade 4	39,371	23.68	100.00
Total	166,236	100.00

4. Income measures

Card:

wagesal: Wage or Salary Income (INCOME1 in the 1980n Census variable dictionary)
selfinc: Nonfarm Self-Employment Income (INCOME2 in the 1980 Census variable dictionary)
farminc: Farm Self-Employment Income (INCOME3 in the 1980 Census variable dictionary)
income: Income From All Sources (INCOME8 in the 1980 Census variable dictionary)

Then he defines self-employed as anyone who has a positive (selfinc + farminc)

Victoria:

inctot: total personal income
ftotinc: total family income
incwage: wage and salary income
incbus00: business and farm income, 2000
incearn: total personal earned income

Variable	Obs	Mean	Std. Dev.	Min	Max
inctot	21,864,217	23201.98	34290.62	-20000	1471000
ftotinc	21,864,217	249646.6	1404037	-30000	9999999
incwage	21,864,217	19101.37	19101.37	0	641000
incbus00	10,986,023	2195.484	15365.94	-10000	573000
incearn	16,810,374	24123.92	35430.46	-19996	1146000

Bartik dataset has no measure of self-employed earnings, so I will use this other variable to define self-employment:

CLASSWKRD:
           0 n/a
          10 self-employed
          11 employer
          12 working on own account
          13 self-employed, not incorporated
          14 self-employed, incorporated
          20 works for wages
          21 works on salary (1920)
          22 wage/salary, private
          23 wage/salary at non-profit
          24 wage/salary, government
          25 federal govt employee
          26 armed forces
          27 state govt employee
          28 local govt employee
          29 unpaid family worker

https://www.dropbox.com/s/8jiij8ntdq1lcau/Screenshot%202019-01-16%2014.56.49.png?dl=0

5. Country codes - grouping into 38 groups

Card: The country codes used by Card can be found in Appendix F of the Codebook for the 1980 5% extracts, available from ICPSR.

He groups countries into 38 groups:

mexico
phillip
india
vietnam
el salvador
china
cuba
dominican rep. 
korea
jamaica
canada
columbia
guatemala
germany
haiti 
poland
taiwan
england
italy
ecuador
japan
iran
honduras
peru
russia
nicaragua
guyana
pakistan
hong kong
trinidad-tobago
west europe+isreal+cyprus+auss+nz
east europe incl romania ukraine yugoslav
middle east turkey bulgaria and the stans
asia and oceana
s america + north am nec
africa
caribbean + central am
else

Somewhat unrelated note: Later on, Card creates even broader categories of countries (e.g., european, high asia, mid asia, mexico), and he includes Canada in the european group, Pakistan and Iran in the high asia group,

Victoria: Issue: The Bartik dataset doesn't have 15/38 groups used by Card:

el salvador
dominican rep. 
jamaica 
colombia 
guatemala 
haiti
taiwan
ecuador
honduras 
peru
nicaragua
guyana
pakistan
hong kong
trinidad-tobago

For these groups, instead of using a person's place of birth I use whether the person is an immigrant combined with her primary ancestry (using the variable ancestr1). So if a person is an immigrant and her first response for ancestry is "salvadoran", I count her as having been born in El Salvador. This is of course not perfect, since some immigrants report being born in a country different than the ancestor.

6. Years in the US

Card: Census data has the immigration year. So the 1980 Census, for example, has a variable that looks like

     IMMIGR            1     26                                                 
                            Year of Immigration                                 
                   0        N/A (born in the United States or                   
                              outlying areas or born abroad of                  
                              American parents)                                 
                   1        1975 to 1980                                        
                   2        1970 to 1974                                        
                   3        1965 to 1969                                        
                   4        1960 to 1964                                        
                   5        1950 to 1959                                        
                   6        Before 1950

So he approximates how many years the person has been in the U.S. using that variable. This allows him to distinguish between people who have been in the U.S. for 20+ years vs. 40+ years.

if immyr=1 then yrsinus=2.5;
else if immyr=2 then yrsinus=7.5;
else if immyr=3 then yrsinus=12.5;
else if immyr=4 then yrsinus=17.5;
else if immyr=5 then yrsinus=25.5;
else if immyr=6 then yrsinus=40;
else yrsinus=.;

Victoria: The Bartik data, on the other hand, only says if the person has been in the US for 21+ years, so don't have the same level of granularity and I am not sure how many years to put. Note: For each obs, we have the person's date of birth, so maybe we can use that to approx how many years in the U.S.

Right now, I am using 30 years for anyone who has been in the U.S. for 21+ years.

YRSUSA2:
           0 n/a
           1 0-5 years
           2 6-10 years
           3 11-15 years
           4 16-20 years
           5 21+ years
           9 missing

vdquadros commented 5 years ago

I had a meeting with Isaac on Thursday, Jan. 17th, and we went over those issues.

It looks like that when Isaac and Paul downloaded the data from ipums they didn't download exactly the same variables, so it's just a matter of re-downloading the appropriate variables and merging with the Bartik dataset.

I downloaded the variables:

BPLD (detailed) - Birthplace [detailed version]
YRIMMIG - Year of immigration
WKSWORK1 - Weeks worked last year

vdquadros commented 5 years ago

What's the appropriate weight to use in stata?

During data cleaning, Card randomly drops half of the observations for natives but keeps all of the observations for immigrants.

He then creates a "weight" variable that =2 if person is native and =1 if person in immigrant.

He then follows to using it in PROC MEANS and PROC GLM, such as:

proc means;
where (imm=0);
weight wt;

And also:

proc glm data=nm;
class eclass xclass homey;
model logwage2=exp exp2 exp3 educ eclass*xclass inschool advanced 
      ft lowhrs 
   hisp_ed hisp_coll black_ed black_coll  asian_ed asian_coll 
   homey*eclass rmsa0 rmsa1 / solution;
output out=nm2 predicted=pred residual=res;
weight wt;

where the later is like an anova with "by" in stata.

When I read about how "weight" works in the PROC GLM, I find that

If you use a WEIGHT statement, PROC GLM computes weighted means and estimates their variance as inversely proportional to the corresponding sum of weights (see the section Weighted Means). However, note that the statistical interpretation of multiple comparison tests for weighted means is not well understood here

which seems to be what aweights are for in stata:

_aweights, or analytic weights, are weights that are inversely proportional to the variance of an observation; that is, the variance of the jth observation is assumed to be sigma^2/w_j, where wj are the weights. Typically, the observations represent averages and the weights are the number of elements that gave rise to the average. For most Stata commands, the recorded scale of aweights is irrelevant; Stata internally rescales them to sum to N, the number of observations in your data, when it uses them.

while pweights in stata are:

pweights, or sampling weights, are weights that denote the inverse of the probability that the observation is included because of the sampling design.

which is also true.

So what should I do? Note: Stata's anova only accepts aweights and fweights as options.

Should I use pweight to collapse the data (to calculate population over CZs, let's say) and use aweight to do anova?

econisaac commented 5 years ago

Hi Victoria,

Here's my reconstruction of Card's logic, and then I'll talk about how to implement this. The reason that Card is giving double the weight to the native observations is that he has dropped half of the observations and so he is then doubling the weight on the natives to get back to the population.

Therefore, if you want to use "pweights" you should create a variable weight=2 for the native observation (so that the inverse of this is 0.5, which is the probability that the observation is included). You don't want to use aweights. In the anova command, "fweight" will do what you want. The stata help file says:

_> ( Frequency fweights indicate replicated data. The weight tells the

command how many observations each observation really represents.
fweights allow data to be stored more parsimoniously.  The weighting
variable contains positive integers.  The result of the command is the
same as if you duplicated each observation however many times and then
ran the command unweighted.)_

So here each native observation really represents 2 natives.

Hope that helps.

Isaac

_

vdquadros commented 5 years ago

Card:

Adjusts wages in 1980 but not in 1990:

In 1980, we have a statement like the one below. We don't have an equivalent one for 1990. replace incwage=75000*1.43 if incwage == 75000 /*pareto fix*/

Considers Mexicans as hispanic in 1980, but not in 1990.

The codebook for 1980 says:

     SPANISH           1     14                                                 
                            Spanish Origin                                      
                   0        N/A (not of Spanish origin)                         
                   1        Mexican                                             
                   2        Puerto Rican                                        
                   3        Cuban                                               
                   4        Other Spanish

And Card's code calls that variable "hispanic" and goes: if hispanic>=1 then hispanic=1.

However, in 1990, Card uses a more detailed variable to define hispanic (since this variable didn't exist in 1980). The codebook is:

HISPAND:
           0 not hispanic
         100 mexican
         102 mexican american
         103 mexicano/mexicana
         104 chicano/chicana
         105 la raza
         106 mexican american indian
         107 mexico
         200 puerto rican
         300 cuban
         [etc]

And Card's code is:

if dhisp=0 or (6<=dhisp<=199) then hispanic=0;
          else hispanic=1;

Since "mexico" is 107, he doesn't consider Mexican as hispanic in 1990.

vdquadros / immigration_enclave

Data cleaning #1

1. Defining who is immigrant

2. Hours worked last year

3. Education labels

4. Income measures

5. Country codes - grouping into 38 groups

6. Years in the US

What's the appropriate weight to use in stata?