Do-file descriptions - Githubissues

Descriptions of all do-file and other files posted will be here. Each comment refers to a do-file.

Hey guys,

Individuals are sampled in a 4-8-4 rotation group pattern. "A rotation group is interviewed for 4 consecutive months, temporarily leaves the sample for 8 months, and then returns for 4 more consecutive months before retiring permanently from the CPS (after a total of eight interviews).". I've created a rotation group identifier which takes on the values 1-9. For example, an individual in group 1 has been surveyed from January to April 2014, and January to April 2015. This will be useful down the line when we want to use panel data analysis, which would require us to follow the same individuals overtime. The do-file I've posted (post1.do) allows us to narrow down the sample to include only individuals who have answered 8 survey rounds. Moving forward, we can merge the group identifier, using merge 1:1 cpsidp year month using... with a dataset that includes other variables. Then we can keep observations for which _merge==3.

Post2.do:

Hey all,

I posted a 2nd do-file creating some measurements of work schedule irregularity/ job quality.

underemployed

binary: =1 if underemployed; 0 otherwise
universe: employed individuals only
assumption: if the individual has a value of 0 for underemployed, we assume he/she is happy with the amount of work they have.

hvaryall, hvarymain, hvaryother

binary: =1 if the individual reports having a usual schedule of hours that vary; 0 otherwise
universe: employed individuals only
assumption: if the individual has a value of 0 for any of these three variables, the individual normally has varied work hours. This makes no assumption on whether the individual prefers varied work hours or not.

jobquality

categorical (nominal) : =0 if the individual id underemployed & hours don't vary, =1 if the individual reports usually having a varied work schedule (main job) & not underemployed, =2 if both underemployed and has a varied work schedule.
assumptions: note that this variable severely limits the number of observations from a value that has 6 digits to 29,066 because of the intersection (when jobquality==2). This variable is nominal because we cannot make any assumptions without knowing individuals' preferences for having a varied work schedule vs a firm one.

meanhrsmain

continuous: reports the average work hours per person over time. (note that this is a "within" statistic and not a "between" statistic because this variable gives the average per individual over time, not the average per year over all individuals). For example, a within statistic when t = 2014 or 2015 and i = individual id, assuming 2 people, would look something like this: (person 1's work hours for 2014 + person 1's work hours for 2015)/2. A between statistic per year would be: (person 1's work hours for 2014 + person 2's work hours for 2014)/2 for 2014 and (person 1's work hours for 2015 + person 2's work hours for 2015)/2. We could create the between measurement per year by using this command assuming the data is in long form: sort year by year: egen yearmean = mean(hrsworkmain) note that for our sample, since we only have 2 years (2014 and 2015), we will only have 2 values for yearmean: one for the average across individuals for 2014 and one for 2015.

sdhrsmain

continuous: reports the individual standard deviation. The higher the sdhrsmain value for an individual, the more variation in work hours they've reported over time relative to an individual with a lower value.

OTHER: sample

binary: = 1 if employed; 0 otherwise.
can be used to restrict the sample, however, be aware that if you use this, you drop observations of all variables when the individual was unemployed. For example, if person 1 was employed in January 2014, and reported being unemployed in February 2014, and back to being employed in March 2014, and you used reg y x1 x2 if sample, (given x1 is employment status) then you would lose all the x2 information for February 2014 for that individual, which may be undesirable. If you want to restrict the sample only to those who report having hours that vary and keep individuals if they have nonmissing values for all 8 months, you can do reg y x1 x2 if hrsvarymain & length==8.

Note: once you xtset your data (command: xtset cpsidp seqdate) where the variable seqdate was created in post1.do, you can analyze your data using xtdescribe, xtsum, xttab, etc. For more information, type help xt in Stata. Note also that although I type reg y x1 and x2 sometimes, since we are using panel data, analysis will not be done using reg unless you choose to do cross sectional analysis, in which case you would use one month of data only and drop all the other months of data. Again, look at the help file for more information on estimation.

Post3.do:

This do-file contains general descriptive statistics using commands meant to analyze panel data as well as a way to take the output from multiple xtsum commands to create your own table using the matrix command. Stata has several commands that are useful for creating tables of descriptive statistics such as tabout, tabstat, svret, putexcel, matrix and others. Some of these commands are user-written and so you may have not already have them, in that case type ssc install command. For example, ssc install tabout.

Important: Before using all the commands that begin with xt-, you are required by Stata to use the xtset command. When you use the xtset command, it tells Stata what the panel and time variables are. xtsum is the panel data analog to the sum command when working with cross-sectional data. xttab is the panel data analog to the tab command when working with cross-sectional data.

At the top of the do-file, I've listed further documentation on the commands being used: xttab and xtsum I've provided the interpretation of the statistics within the do-file as comments below each corresponding command for convenience so you may view the interpretation in the results window directly below the output.

Note on the creation of a table using the matrix command: Stata classifies the more common commands as r-class (general commands) or e-class commands (estimation commands). Examples of r-class commands are sum, xtsum, tab, xttab, describe, etc. Examples of e-class commands are reg, ivreg, probit, logit, etc. After every use of each command, you may view the stored results using return list or ereturn list, after r-class or e-class commands are used, respectively. Stata does not store these results for future use. For example, in our dataset, if you type sum hrsworkmain and then return list, you will see a list of scalars temporarily storing the results in r(mean), r(sd), etc. If you then type tab month and then return list, you can see that you've replace the stored results from the previous sum command with the results from the most recent sum command used. Therefore, within my do-file, we will create a matrix, use xtsum, store the results from xtsum in the matrix, filling in the whole first row, use xtsum again, completely filling in the second row, and use it a third time, completely filling in the 3rd row.

This is the most general way to store results and create a table if you don't know how to use tabout, tabstat, or other commands to create descriptive statistics tables. In the future, when you analyze your data using a regression or any other estimation method, you may use outreg2, regsave, estout/esttab, or you can just create your own matrix. For example:

sysuse auto regress price mpg trunk ereturn list matrix b = e(b) matrix list b

Post4.do:

This do-file uses the spmap command which will allow us to map statistics we create onto a map of the United States.

Note: In the do-file, I mention that you should read this. The tutorial contains a few errors. Step 2 tells you to download a file called s_06se12.zip which does not exist anymore. Instead of that file in the page they linked, download the updated version by clicking Download Compressed Shapefile s_11au16.zip. Save this file in the same folder as your CPS data files and do-files. Using the shp2dta command, two data files are created in Stata file format: a database file named usdb, and a file containing coordinates named uscoord. An id variable is created within the database file (usdb.dta) using the genid( ) option of the shp2dta command. Now, when we use usdb and open the data browser, you will see some columns in red. This means that those variables are string variables. We create a numeric value to identify each state based on their FIPS value. To do this, we type gen statefip = real(FIPS). The variable I created is named statefip because it has the same name as the variable I later want to merge from the CPS data. The function real( ) creates a numeric variable where the values are real numbers. duplicates drop statefip will drop any duplicates we have, in this case the observation for Maryland where there was no information for longitude and latitude. This is because Stata will return an error telling us we cannot merge usdb.dta with our CPS data based on statefip since it does not uniquely identify our data unless we only have one record for each state. Saving this dataset, we next work on creating statistics to be graphed onto the map. Using your CPS data, we create 4 variables:

meanhrsstate: This variable contains the overall average work hours (main job) across individuals over time for each state.
sdhrsstate: This variable contains the standard deviation across individuals for each state.
sdhrspp: This variable contains the standard deviation of work hours per person.
avghrspp: Using sdhrspp to create this variable, it contains the average standard deviation of all individuals per state.

note: meanhrsstate, sdhrsstate, and avghrspp are constant within each state. These are the variables that we will be able to create maps with.

Next, we drop all the duplicate observation of state and merge our cps data file with the map file (usdb.dta) based on the variable statefip. We keep only those observations that merged since we do not have information in our CPS data for the observations that were not merged.

Finally, spmap sdhrsstate using uscoord if id !=1 & id!=4 & id!=54 & id!=55, id(id) fcolor(Blues) legend(symy(*2) symx(*2) size(*2)) creates the map. We map our variable sdhrsstate using the dataset with the coordinates:uscoord.dta based on the criteria that id is not equal to 1. This excludes Alaska, American Samoa, Guam and Northern Marianas from our map. Not excluding these would make our map of the main states too small. Our unique identifier is the id variable we created in the usdb dataset earlier. We graph our data onto the map using shades of blue, but other colors are available. The last option legend( ) allows us to enlarge our legend. Within the map, you can use the graph editor to make further changes.

These variables are just a few examples of variables that can be mapped. The thing to keep in mind is that when you create variables to be mapped within a panel data context, you want to keep the value constant by state (if using a U.S. map). Another possibility is to create different monthly averages or monthly standard deviations and create a map for each of the 24 months of the survey.

Post5.do:

This do-file will teach you how to create monthly (24 total) maps of the standard deviation of individuals of each state, which you can then combine to create a gif.

We start by creating variables representing the average work hours per state and standard deviation of work hours per state, for each month from January, 2014 - December, 2015.

The nested loop works like this: The first time around the commands fill in the macros like this: by statefip: egen meanhrsstatem12014 = mean(hrsworkmain) if month==1 & year==2014 label var meanhrsstatem12014 "Mean hours worked by state, m1y2014" by statefip: egen sdhrsstatem12014 = sd(hrsworkmain) if month==1 & year==2014 label var sdhrsstatem12014 "SD of hours worked by state, m1y2014" It continues on filling in i with 2 - 12, keeping j =2014. Once this is over, it them moves on to the next j value, 2015: by statefip: egen meanhrsstatem12015 = mean(hrsworkmain) if month==1 & year==2015 label var meanhrsstatem12015 "Mean hours worked by state, m1y2015" by statefip: egen sdhrsstatem12015 = sd(hrsworkmain) if month==1 & year==2015 label var sdhrsstatem12015 "SD of hours worked by state, m1y2015" This continues until all i values have been used in execution of the commands.

The result will be the meanhrsstatem* and the sdhrsstatem* variables with suffixes as 12014, 22014, 32014,..., 122014, 12015, 22015, ..., 122015.

These variables are not the within estimators so they don't supply us with as much information on how individuals' work hours vary. To do this, we would execute lines 17-25. We use the less informative between estimators to create the gifs to show changes over 24 survey months (24 graphs) instead of a gif of two months.

As an example I create one map the variable sdhrsstatem52015 which will show the distribution of standard deviation of work hours throughout the states in the survey month of May, 2015 (lines 26-33). This makes it easier to show which states have a higher standard deviation between those who are employed in those states. First, we preserve the existing dataset so that when the do-file finishes, it restores the dataset that was preserved. This is useful because in order to create the map, we have to drop duplicate observations but I want to restore the complete dataset after the map is created so that I can continue to work with it. Sorting it by state, year and month, I drop duplicate observations of the variable sdhrsstatem52015. Next, I drop the first observation because it shows that there are still two observations of Alabama, and the first observation of the dataset is Alabama with missing information for sdhrsstatem52015. I merge the basemap data file (look at post4.do for more info) to create an empty map of the states. Dropping states that did not merge (they will have missing information for the variable sdhrsstatem52015), we then make the map using the spmap command using the coordinate file (post4.do), omitting states that are not on the mainland because this would cause our map to be too small as it incorporates the large distance between the main states and a place like American Samoa.

Now that we've created the variables we will create the maps needed to create the gif. We start by defining a local "x" as 0. We implement the loop to create each map for each month in 2014-2015. local ++x starts the counter. When the loop is executed the first time for i=2014 j=1, local ++x will make a macro with a value of 1. graph export uses the macro "x" naming all the graphs in ascending order. Therefore, all the graphs made will be named map001-map0024. To keep the do-file short, I use one loop to create the graphs. You must go into your folder and change the names of the files so that your graphs are titled map001-map024. This is because I reference the names of all the graphs using `GraphPath'map%03d.png (line 74) which means that there is a 3 digit number following the word map and that the 3 digit number starts with 0.

Next, we create the gif. To make it, you must download ffmpeg In my do-file, I use the command shell to create the gif because I am on a Mac. If you use Windows or some other computer with a different operating system, use winexec in place of shell.

You should now have a file named map.gif in your working directory. Dragging the file over a web browser will open a tab with the gif.

The last lines illustrate the same concept with the creation of kernel density to show the distribution of standard deviation of work hours across states for each month. Again, you must manually rename some of the graphs before creating the gif.

Creating gifs of maps using georeferenced data is a great way to illustrate distribution of statistics across location over time and makes looking for specific patterns much easier than using something like an xtline, overlay graph, especially when there are too many values of statefip that will clutter the graph.

Post6.do:

This do-file teaches you how to make spells/ runs using tsspell . A spell is a continuous length of an occurrence. A spell of unemployment would be a continuous period of a few weeks/months/years of being unemployed. Note that in our dataset, people were surveyed in 4-8-4 rotation groups where they were surveyed for 4 months, not surveyed for the next 8 months, then surveyed the next 8 months. We have already dealt with identifying rotation groups in the group variable created using post1.do.

We start by creating a variable parttime to identify those who report being "usually" part-time, regardless of whether they are part-time due to economic reasons or non-economic reasons.

Ordering the variables are always useful when working with a new command. It makes visualization of what is happening after every command easier across many variables in the data browser.

Although not shown, I've already xtset the data using xtset cpsidp seqdate. We've done this multiple times and so your data should already be set if you've created your dataset using the first few do-files. We install the user-written command, and then implement tsspell on the variable* underemployed* tsspell underemployed ,cond(underemployed==1) seq(underempseq) spell(underempspell) /// end(underempend) Here, we identify spells with the condition that someone is underemployed during that period of time. 3 variables are made. underempseq is made representing the sequence of time that the spell is continuous. underempspell tells you which spell the observation is a part of. underempend is an identifier, telling you the end of the spell. list cpsidp underemployed underempseq underempspell underempend year month in /// 16105/16112, sep(0) The output shows you that this individual with cpsidp = 20140101534902 has had 2 spells of underemployment, the first started in March 2014, and ended in April 2014, and the second spell started in February 2015 and ended in April 2015. underempseq shows you the sequence of time periods that each spell lasted, starting at 1 each time a new spell starts. underempspell tells you which spell those time periods belong to. underempend identifies the time that each spell ended, with a value of 1.

Next, I generate a variable underempmaxspell to identify the maximum amount of spells per individual, keeping missing values as missing. underemplongspell is created to identify the longest spell or underemployment an individual has had. xttab underemplongspell if underempmaxspell==1 shows us that 33 people were underemployed throughout the 8 months they were surveyed.

We continue by creating variables to identify spells of when people reported having varied work hours, usually part-time employment, and when an individual reported being part-time for economic reasons (parttime2)

Note: onespell is a user-written command by Christopher F. Baum. Similar to tsspell, it creates a new dataset of longest continuous non-missing observations over time per panel. ssc install onespell help onespell webuse grunfeld, clear tab company *each company has 20 observations *make it so some companies have shorter spells of continuous non-missing information replace invest = . in 28 replace mvalue = . in 55 replace kstock = . in 87 replace kstock = . in 94 onespell invest mvalue kstock, saving(grun1) use grun1, clear tab company *company 2, 3 and 5 now have less observations because only their longest spell of /// continuous non-missing observations were kept.

Post7.do:

In this do-file, we will create a gif using triplot graphs, and somg graphs using xtline. A triplot is a triangular plot used to show composition of the dataset based on 3 variables, with values that sum to 1 or very close to 1. For my example, I create 3 variables pct part, pctfull, pctneither representing the percent of individuals for each state per month who report being usually part time, full time, or neither. list statefip pctneith pctfull pctpart mnthyr test in 1 This tells us that in January 2014, our sample of Alabama consisted of approximately 56.1% of people who were neither part-time nor full-time, 33.6% of people were full-time, and 10.3% were part-time usually. The test variable was created as the sum of the three variables and here it's value represents 100% of the sample at this time.

We check to see if our pct* variables are less than 1 by using assert. Our do-file will terminate at that point if the assertions are false. When we assert that test=1, Stata tells us our assertion is false. Upon further inspection using tab test, we see that some observations have a test value that of .9999999 which is fine.

We install the user-written command triplot and create a triplot of the composition of these employment categories for each month. We will see each colored dot representing a series of states. In creation of the triplot, I leave out the legend because it covers the graph. In practicality you should not leave out the legend, and instead use an option to edit the position of the legend.

We create the triplot graphs for each value of mnthy, creating triplot_001.png-triplot_024.png files in our directory. Once again, we use the shell command if on a Mac, and the winexec if using Windows. Your directory should contain a file named triplot.gif. Dragging this file over your web browser icon should open a tab of the gif.

Another graph useful in showing changes over time is a graph created using xtline which shows changes in a continuous variable by some category (here, by region) over time. In order to do so, we create variables that are unique by region and time period. We drop the duplicate observations of region and time period and can then execute the xtline command. Omitting the overlay option shows graphs of each region side by side instead of creating one graph for all regions.

jessicalum / Summer-Research

Do-file descriptions #7