vdquadros / immigration_enclave

2 stars 3 forks source link

CZs not a unique identifier #6

Closed vdquadros closed 5 years ago

vdquadros commented 5 years ago

Today in the meeting we saw that I should be merging files with only one observation per CZ. However, some files seem to have by construction more than one obs per CZ.

Just to document a simple example.

The cell1 script in the 2000 folder collapses the data by the variables: rczone, native, male, eclass and xclass2. The resulting dataset is called bigcells.dta

In the table6 script, we load the bigcells dataset and for each value that eclass takes, we keep only the male and native observations. However, rczone is not yet a unique identifier since there are 4 values for xclass2 for each obs that is male, native, and in a specific eclass.

xclass2 is an experience variable and eclass is an education variable. Each of them has 4 categories.

Variable definitions in Card's code:

if educ<12 then eclass=1;
else if educ=12 then eclass=2;
else if educ<16 then eclass=3;
else eclass=4;
if exp<=10 then xclass2=1;
else if exp<=20 then xclass2=2;
else if exp<=30 then xclass2=3;
else xclass2=4;
c=1;

Below are snippets of the code that saves bigcells and also the ones that use bigcells

Script cell1 saves bigcells

SAS code

proc summary;
class rmsa native male eclass xclass2;
var logwage2 lw2sq res ressq pred predsq respred imm female educ exp c
         dropout hs somecoll collplus college advanced    ;
output out=here.bigcells
mean=
sum(c)=count;
weight wt;

Script table6 uses bigcells

SAS code

*this macro gets native wages by eclass;
%macro nwage(ed);
%let edg=&ed;
data nw&edg;
set here.bigcells;
if native=1 and male=1 and eclass=&edg and xclass2=. ;
nwage&edg=logwage2;
nres&edg=res;
npred&edg=pred;
ncountw&edg=count / 1000;
keep rmsa nwage&edg npred&edg nres&edg ncountw&edg;
proc sort; by rmsa;
%mend;

Stata code

use data/2000/bigcells.dta, clear
local education_groups 1 2 3 4 
forv i=1/4{
    local edg : word `i' of `education_groups'
    preserve
    keep if native == 1 & male == 1 & eclass == `edg' 
    gen nwage`edg' = logwage2
    gen nres`edg' = res
    gen npred`edg' = pred
    gen ncountw`edg' = count/1000
    keep rczone nwage`edg' npred`edg' nres`edg' ncountw`edg' 
    sort rczone
    save data/2000/nw`edg'.dta, replace
    restore
}
           |                   eclass
   xclass2 |         1          2          3          4 |     Total
-----------+--------------------------------------------+----------
         1 |   175,066    464,765    295,094    292,924 | 1,227,849 
         2 |   155,468    529,215    302,775    315,280 | 1,302,738 
         3 |   152,129    526,231    283,651    309,079 | 1,271,090 
         4 |   153,342    442,467    169,002    166,318 |   931,129 
-----------+--------------------------------------------+----------
     Total |   636,005  1,962,678  1,050,522  1,083,601 | 4,732,806 
vdquadros commented 5 years ago

The files that Card creates make sense if the SAS statement

if native=1 and male=1 and eclass=&edg and xclass2=.

effectively collapses the variables across all categories of xclass2 when we say "xclass2 = ."


Also in table6, we again use bigcells and only "filter" by native and male. This results in 16 obs for each rczone (the 4 of the eclass + 4 of xclass2). The resulting file (nw80) is one of the merged by rczone by Card.

SAS code Creating nw80

data nw80;
set c80.bigcells;  
if native=1 and male=1 and eclass=. and xclass2=.;
nres80=res;
label nres80='residual wage native men 1980';
keep rmsa nres80;
proc sort; by rmsa;

Merging

data big;
merge m1 m80 m90 m2 m280 m290 
     nw1 nw2 nw3 nw4 iw1 iw2 iw3 iw4
     nw901 nw902 nw903 nw904 iw901 iw902 iw903 iw904
     nw801 nw802 nw803 nw804 iw801 iw802 iw803 iw804
     inflow nw nw80 nw90 iw iw80 iw90 mfg80 mfg90;
by rmsa;
econisaac commented 5 years ago

Hi,

I think that the SAS code "xclass=." means xclass is missing, but you should check by creating a dummy dataset in SAS and seeing what happens when you run that code on it. (It seems like your interpretation is correct...since the way the variable is created there are no missings...this is not quite my understanding of how SAS works....but I believe that you have SAS on your machine (I paid for it!) so its worth seeing what happens in a small dataset where none of the variable observations are missing).

The SAS syntax in the second block seems to be taking observations where eclass and xclass are missing. I wonder if elsewhere in the code this was an inelegant way of taking means? (That is, the SAS code is somehow going to down to only one of the exclass x xlcass2 cells; I think it is going down to the cell where these are both missing...but I might be wrong).

isaac

vdquadros commented 5 years ago

Hi Issac,

I think it might be an inelegant way of taking means/collapsing, but I can't identify in Card's code how he would have "engineered" the dataset to work this way.

In our case, the bigcells dataset doesnt have any missing values for eclass and xclass2

What I think is a reasonable idea is to create new bigcells datasets. Instead of collapsing the dataset by (rczone native male eclass xclass2), I create two other datasets: one of them collapses by (rczone native male eclass) and another one collapses by (rczone native male).

But I will see what the . does.

econisaac commented 5 years ago

Yeah...this is not a SAS programming trick I've ever come across. But I tend to prefer long, explicit and inelegant solutions....

(It might be that because there are no non-missing values of the variable, that it somehow takes a mean).

econisaac commented 5 years ago

And, yes, I think the best thing to do is to create a dummy dataset and just check, rather than speculate.

econisaac commented 5 years ago

Interesting. So this does seem to be taking some kind of average as you had thought.

vdquadros commented 5 years ago

Suppose we have this dataset, which was saved in excel as "test2.xlsx".

state eclass xclass2 male native age c
1 1 2 1 1 20 1
1 1 1 1 1 30 1

We then run the following script in SAS

/** Import an XLSX file.  **/

PROC IMPORT DATAFILE="/folders/myfolders/test2.xlsx"
            OUT=WORK.test2
            DBMS=XLSX
            REPLACE;
RUN;

/** Print the results. **/
PROC PRINT DATA=WORK.test2; RUN;

/** Pick best format for variable age. **/
data new;
    set WORK.test2(rename=(age = age_char)) ;
    age = input(age_char,best.);
run;

/** Create the bigcells dataset. **/
proc summary;
class state native male eclass xclass2;
var age c;
output out=WORK.bigcells
mean=
sum(c)=count;

PROC PRINT DATA=WORK.bigcells; RUN;

data nw1;
set WORK.bigcells;
if native=1 and male=1 and eclass=1 and xclass2=. ;
nage1=age;
ncount1=count;
keep state nage1 ncount1;
proc sort; by state;

PROC PRINT DATA=WORK.nw1; RUN;

The output bigcells is

which is much bigger than I was expecting.

Obs state native male eclass xclass2 TYPE FREQ age c count
1 . . . . . 0 2 25 1 2
2 . . . . 1 1 1 30 1 1
3 . . . . 2 1 1 20 1 1
4 . . . 1 . 2 2 25 1 2
5 . . . 1 1 3 1 30 1 1
6 . . . 1 2 3 1 20 1 1
7 . . 1 . . 4 2 25 1 2
8 . . 1 . 1 5 1 30 1 1
9 . . 1 . 2 5 1 20 1 1
10 . . 1 1 . 6 2 25 1 2
11 . . 1 1 1 7 1 30 1 1
12 . . 1 1 2 7 1 20 1 1
13 . 1 . . . 8 2 25 1 2
14 . 1 . . 1 9 1 30 1 1
15 . 1 . . 2 9 1 20 1 1
16 . 1 . 1 . 10 2 25 1 2
17 . 1 . 1 1 11 1 30 1 1
18 . 1 . 1 2 11 1 20 1 1
19 . 1 1 . . 12 2 25 1 2
20 . 1 1 . 1 13 1 30 1 1
21 . 1 1 . 2 13 1 20 1 1
22 . 1 1 1 . 14 2 25 1 2
23 . 1 1 1 1 15 1 30 1 1
24 . 1 1 1 2 15 1 20 1 1
25 1 . . . . 16 2 25 1 2
26 1 . . . 1 17 1 30 1 1
27 1 . . . 2 17 1 20 1 1
28 1 . . 1 . 18 2 25 1 2
29 1 . . 1 1 19 1 30 1 1
30 1 . . 1 2 19 1 20 1 1
31 1 . 1 . . 20 2 25 1 2
32 1 . 1 . 1 21 1 30 1 1
33 1 . 1 . 2 21 1 20 1 1
34 1 . 1 1 . 22 2 25 1 2
35 1 . 1 1 1 23 1 30 1 1
36 1 . 1 1 2 23 1 20 1 1
37 1 1 . . . 24 2 25 1 2
38 1 1 . . 1 25 1 30 1 1
39 1 1 . . 2 25 1 20 1 1
40 1 1 . 1 . 26 2 25 1 2
41 1 1 . 1 1 27 1 30 1 1
42 1 1 . 1 2 27 1 20 1 1
43 1 1 1 . . 28 2 25 1 2
44 1 1 1 . 1 29 1 30 1 1
45 1 1 1 . 2 29 1 20 1 1
46 1 1 1 1 . 30 2 25 1 2
47 1 1 1 1 1 31 1 30 1 1
48 1 1 1 1 2 31 1 20 1 1

The output nw1 is

Obs state nage1 ncount1
1 . 25 2
2 1 25 2

Which means that SAS ignores the different xclass2 cells and reports a new average of the variable age instead of reporting 2 observations, one for each value of xclass2.


So the equivalent in Stata is to collapse without xclass2. This is what I had conjectured in my comment above.

If we collapsed including the variable xclass2 we would have two obs for the same state.

The following Stata code results in the correct output

import excel using test2, firstrow clear

collapse (sum) count = c (mean) age, by(state native male eclass)

keep if native == 1 & male == 1 & eclass == 1
gen nage1 = age
gen ncount1 = count
keep state nage1 ncount1
sort state ncount1
save nw1.dta, replace

use nw1.dta, clear 
state nage1 ncount1
1 25 2

The following Stata code results in the wrong output

import excel using test2, firstrow clear

collapse (sum) count = c (mean) age, by(state native male eclass xclass2)

keep if native == 1 & male == 1 & eclass == 1
gen nage1 = age
gen ncount1 = count
keep state nage1 ncount1
sort state ncount1
save nw1.dta, replace

use nw1.dta, clear 
state nage1 ncount1
1 30 1
1 20 1

Hi Issac,

I think it might be an inelegant way of taking means/collapsing, but I can't identify in Card's code how he would have "engineered" the dataset to work this way.

In our case, the bigcells dataset doesnt have any missing values for eclass and xclass2

What I think is a reasonable idea is to create new bigcells datasets. Instead of collapsing the dataset by (rczone native male eclass xclass2), I create two other datasets: one of them collapses by (rczone native male eclass) and another one collapses by (rczone native male).

But I will see what the . does.

econisaac commented 5 years ago

Great! I guess the lesson is that for the ambiguous SAS code its worth creating dummy datasets and seeing what it does...