Getting ADH subdirectory to run

larsvilhuber commented 4 years ago

cty_census.dta was created using a number of different inputs: QCEW, census extracts, and population counts from SEER.

andrewfoote commented 4 years ago

My past self thought it would be a good idea to make this file on the M Drive, rather than actually in the line of processing. Is there a tag for "shockingly bad ideas"? I will clean this up and put together the files.

andrewfoote commented 4 years ago

@larsvilhuber Not sure if you want more details than these files, but this is a start - I can also re-create from scratch if needed.

larsvilhuber commented 4 years ago

Hm. Also cw_cty_czone.dta:

https://github.com/larsvilhuber/MobZ/blob/dc9340e841f32abd7cc70a72ab4b5b2853a770cb/programs/07_adh/00.01.IPW_creation.do#L6

I'm going to integrate the earlier files into the ADH data creation flow directly. QCEW should probably come from the same file we use in the QCEW folder.

andrewfoote commented 4 years ago

@larsvilhuber I finally figured out my naming convention. cw_cty_czone is "crosswalk from county to commuting zone"

andrewfoote commented 4 years ago

And I figured out where it came from: https://www.ddorn.net/data/cw_cty_czone.zip

I will update the read-me

larsvilhuber commented 4 years ago

@andrewfoote Next part: Need the source of the NHGIS data (old "census_together.do")

https://github.com/larsvilhuber/MobZ/blob/c23950695f8343ec7dd9b509c9e6507a407772d3/programs/07_adh/00_01_census_creation.do#L1

https://github.com/larsvilhuber/MobZ/blob/c23950695f8343ec7dd9b509c9e6507a407772d3/programs/07_adh/00_01_census_creation.do#L20

https://github.com/larsvilhuber/MobZ/blob/c23950695f8343ec7dd9b509c9e6507a407772d3/programs/07_adh/00_01_census_creation.do#L38

For this one, I believe that NHGIS terms of use allow to redistribute the extracted file, subject to citation. You should also describe how you extracted the file.

andrewfoote commented 4 years ago

@larsvilhuber How should I describe that? Should I just say "extracted from NHGIS" with following citation:

Steven Manson, Jonathan Schroeder, David Van Riper, and Steven Ruggles. IPUMS National Historical Geographic Information System: Version 14.0 [Database]. Minneapolis, MN: IPUMS. 2019. http://doi.org/10.18128/D050.V14.0

larsvilhuber commented 4 years ago

@andrewfoote Yes. How big are the files?

larsvilhuber commented 4 years ago

@andrewfoote Well, strictly speaking, you need to cite the version as it was when you downloaded those files.

andrewfoote commented 4 years ago

@andrewfoote Yes. How big are the files?

@larsvilhuber The raw data files are about 1-4 MB each.

~~Oh and I need to figure out which version it was in...2015?~~

Apparently the citation is:

Minnesota Population Center. National Historical Geographic Information System: Version 11.0 [dataset]. Minneapolis, MN: University of Minnesota, 2016. https://doi.org/10.18128/D050.V11.0

larsvilhuber commented 4 years ago

@andrewfoote OK, can you see if you can attach them as ZIP files to this ticket, and I'll put them into the right location.

andrewfoote commented 4 years ago

@larsvilhuber Attaching here.

If kept in all the same folder, they should run. Should require a bit of re-jiggering of the 00_01_census_creation.do file, but nothing major. Let me know if you want me to do those edits.

githubfolder.zip

larsvilhuber commented 4 years ago

@andrewfoote When was this Stata code downloaded (was there a "version" line in there?) Because it bombs on various lines:

https://github.com/larsvilhuber/MobZ/blob/ea820aabf49110f6ba191e7a4b74a62a0bbf17ed/raw/nhgis/nhgis0008_ds95_1970_county.do#L650

unknown egen function rowsum()
r(133);

https://github.com/larsvilhuber/MobZ/blob/ea820aabf49110f6ba191e7a4b74a62a0bbf17ed/raw/nhgis/nhgis0008_ds95_1970_county.do#L652

. gen fips = statea||countya 
statea| invalid name
r(198);

Can you find out what the maximum Stata "version" is that needs to be set? Tried it with Stata 14 and 16. rowsum was replaced years ago with rowtotal... so this is old.

larsvilhuber commented 4 years ago

@andrewfoote There are literally 5 lines added to the end of the read-in, and 4 of them don't work:

. keep fips female_emp 
variable female_emp not found
r(111);

(because it is not created - it is femalepop_16_65 that is created)

Can you let me know how to correct them (what are we after here)? I can do global replace, or we can handle the modifications in the downstream programs and concentrate here on getting readin to work...

andrewfoote commented 4 years ago

@larsvilhuber The parts that are bombing I added to the read-in code. I am going to fix them right now and drop them back into the `nhgis' folder.

andrewfoote commented 4 years ago

@larsvilhuber Should be better now?

larsvilhuber commented 4 years ago

@andrewfoote Almost... fixed. 5cae7c6..d06c0c9

andrewfoote commented 4 years ago

@larsvilhuber Oh geez. Too bad there isn't a facepalm emoji option in the "reactions" list.

larsvilhuber commented 4 years ago

You mean this one ?

andrewfoote commented 4 years ago

@larsvilhuber Is this done? I think we resolved these issues, unless I am missing something.

larsvilhuber commented 4 years ago

No, not quite:

https://github.com/larsvilhuber/MobZ/blob/7af0a5ba7c90e55dbf5a9362181b36d8129264bb/programs/07_adh/00_01_census_creation.do#L74

where do those three "cty_industry" files come from?

(Don't edit the file I'm pointing to, I'm working on it"

larsvilhuber commented 4 years ago

And another thing @andrewfoote

. keep pop* female* fips manu_emp  total_emp bachelors year
variable bachelors not found
r(111);

in the 1970s data. Because the Stata programs don't define variable names, not sure which one it is. It's defined for the other years.

larsvilhuber commented 4 years ago

@andrewfoote One more thing:

https://github.com/larsvilhuber/MobZ/blob/0ae714b87396d2c81452158149432c1658dacd7d/programs/07_adh/00.02.mergecounty.do#L36

larsvilhuber commented 4 years ago

@andrewfoote more data:

https://github.com/larsvilhuber/MobZ/blob/0ae714b87396d2c81452158149432c1658dacd7d/programs/07_adh/00.02.mergecounty.do#L65

which one is that (qcew employment or earnings)?

andrewfoote commented 4 years ago

No, not quite:

https://github.com/larsvilhuber/MobZ/blob/7af0a5ba7c90e55dbf5a9362181b36d8129264bb/programs/07_adh/00_01_census_creation.do#L74

where do those three "cty_industry" files come from?

(Don't edit the file I'm pointing to, I'm working on it"

The cty_industry files were provided by David Dorn, but I can't find the email because of the awful Outlook search feature. I do have the files, which I can drop somewhere.

andrewfoote commented 4 years ago

@andrewfoote more data:

https://github.com/larsvilhuber/MobZ/blob/0ae714b87396d2c81452158149432c1658dacd7d/programs/07_adh/00.02.mergecounty.do#L65

which one is that (qcew employment or earnings)?

@larsvilhuber This should be the employment, total and manufacturing

andrewfoote commented 4 years ago

@andrewfoote One more thing:

https://github.com/larsvilhuber/MobZ/blob/0ae714b87396d2c81452158149432c1658dacd7d/programs/07_adh/00.02.mergecounty.do#L36

This is a pretty bad one. I think that this is the seer county population estimates, read in and then made into a county-year dataset with two variables: pop_16_65 and totalpop, from 1990 to 2015.

@larsvilhuber I can provide code for that, but I don't know how to live-extract it onto ECCO, although I can give you a source: https://seer.cancer.gov/popdata/yr1990_2018.singleages/us.1990_2018.singleages.adjusted.txt.gz

larsvilhuber commented 4 years ago

@andrewfoote more data: https://github.com/larsvilhuber/MobZ/blob/0ae714b87396d2c81452158149432c1658dacd7d/programs/07_adh/00.02.mergecounty.do#L65

which one is that (qcew employment or earnings)?

@larsvilhuber This should be the employment, total and manufacturing

So "qcew_county.dta" file?

@andrewfoote One more thing: https://github.com/larsvilhuber/MobZ/blob/0ae714b87396d2c81452158149432c1658dacd7d/programs/07_adh/00.02.mergecounty.do#L36

This is a pretty bad one. I think that this is the seer county population estimates, read in and then made into a county-year dataset with two variables: pop_16_65 and totalpop, from 1990 to 2015.

@larsvilhuber I can provide code for that, but I don't know how to live-extract it onto ECCO, although I can give you a source: https://seer.cancer.gov/popdata/yr1990_2018.singleages/us.1990_2018.singleages.adjusted.txt.gz

Give me the code, I can handle the download part on ECCO.

larsvilhuber commented 4 years ago

No, not quite: https://github.com/larsvilhuber/MobZ/blob/7af0a5ba7c90e55dbf5a9362181b36d8129264bb/programs/07_adh/00_01_census_creation.do#L74

where do those three "cty_industry" files come from? (Don't edit the file I'm pointing to, I'm working on it"

The cty_industry files were provided by David Dorn, but I can't find the email because of the awful Outlook search feature. I do have the files, which I can drop somewhere.

David Dorn's website has a bunch of files, but because they are in ZIP, hard to know what's in them. If you can read through https://www.ddorn.net/data.htm and identify the relevant ZIP file, that would probably help.

In Outlook, try "from:dorn"

andrewfoote commented 4 years ago

I literally cannot find the email. However, I think he created the files using these imputation files on his website:

https://www.ddorn.net/data/cbp1980_imputations.zip https://www.ddorn.net/data/cbp1990_imputations.zip https://www.ddorn.net/data/cbp2000_imputations.zip

larsvilhuber commented 4 years ago

@andrewfoote Do you still have the file, and can we get it out from Census? Or do we have to recreate it?

andrewfoote commented 4 years ago

I still have the files - I can drop them into repository.

larsvilhuber commented 4 years ago

This is a pretty bad one. I think that this is the seer county population estimates, read in and then made into a county-year dataset with two variables: pop_16_65 and totalpop, from 1990 to 2015. @larsvilhuber I can provide code for that, but I don't know how to live-extract it onto ECCO, although I can give you a source: https://seer.cancer.gov/popdata/yr1990_2018.singleages/us.1990_2018.singleages.adjusted.txt.gz

Give me the code, I can handle the download part on ECCO.

@andrewfoote I have downloaded the SEER file, but don't have the read-in code. Can you upload the popcounts.dta file directly, for now? Loose end that we can handle afterwards (see #33)

larsvilhuber commented 4 years ago

https://github.com/larsvilhuber/MobZ/blob/0ae714b87396d2c81452158149432c1658dacd7d/programs/07_adh/00.02.mergecounty.do#L65

which one is that (qcew employment or earnings)?

@larsvilhuber This should be the employment, total and manufacturing

So "qcew_county.dta" file?

This file has annual_avg_empl (by year fips naics2). Can you suggest a transformation that gives the desired structure? Simply subset by manufacturing?

larsvilhuber commented 4 years ago

Change 7a50eb2 would be for this ticket.

larsvilhuber commented 4 years ago

@andrewfoote : Looking at https://github.com/larsvilhuber/MobZ/blob/7a50eb257aacd5b4c72718e7a8f8aea52bcc0b39/programs/07_adh/02.01.cutoff_loop.do#L25 and following lines (also in 02.02.overall_loop.do): there are still a few lines in there which call programs that do not exist as such:

02.01.cutoff_loop.do:include "$dodir/replication/iteration/aggregatedata.do"
02.01.cutoff_loop.do:   include "$dodir/county_merge.do";

I'm guessing that

aggregatedata.do -> 00_07_aggregatedata.do (but see the last line, which seemed to write out a file that isn't used anywhere else, ambiguity about "industry_data" vs. "industrydata"
county_merge.do -> 00_05_mergecounty.do

Can you confirm?

andrewfoote commented 4 years ago

https://github.com/larsvilhuber/MobZ/blob/0ae714b87396d2c81452158149432c1658dacd7d/programs/07_adh/00.02.mergecounty.do#L65

which one is that (qcew employment or earnings)?

@larsvilhuber This should be the employment, total and manufacturing

So "qcew_county.dta" file?

This file has annual_avg_empl (by year fips naics2). Can you suggest a transformation that gives the desired structure? Simply subset by manufacturing?

I just checked something in to do this, which is in /07_adh/ and should probably be renamed into the sequence.

andrewfoote commented 4 years ago

This is a pretty bad one. I think that this is the seer county population estimates, read in and then made into a county-year dataset with two variables: pop_16_65 and totalpop, from 1990 to 2015. @larsvilhuber I can provide code for that, but I don't know how to live-extract it onto ECCO, although I can give you a source: https://seer.cancer.gov/popdata/yr1990_2018.singleages/us.1990_2018.singleages.adjusted.txt.gz

Give me the code, I can handle the download part on ECCO.

@andrewfoote I have downloaded the SEER file, but don't have the read-in code. Can you upload the popcounts.dta file directly, for now? Loose end that we can handle afterwards (see #33)

Doing this right now.

larsvilhuber commented 4 years ago

@andrewfoote Progress!

next error: https://github.com/larsvilhuber/MobZ/blob/326c44016c76eb86932006195b553d2b7ae05ddb/programs/07_adh/00_07_cz_merge.log#L290

andrewfoote commented 4 years ago

@andrewfoote : Looking at

https://github.com/larsvilhuber/MobZ/blob/7a50eb257aacd5b4c72718e7a8f8aea52bcc0b39/programs/07_adh/02.01.cutoff_loop.do#L25

and following lines (also in 02.02.overall_loop.do): there are still a few lines in there which call programs that do not exist as such:
02.01.cutoff_loop.do:include "$dodir/replication/iteration/aggregatedata.do"
02.01.cutoff_loop.do:   include "$dodir/county_merge.do";
I'm guessing that

aggregatedata.do -> 00_07_aggregatedata.do (but see the last line, which seemed to write out a file that isn't used anywhere else, ambiguity about "industry_data" vs. "industrydata"

county_merge.do -> 00_05_mergecounty.do

Can you confirm?

I believe aggregatedata.do --> zz_aggregatedata

and county_merge is missing. I going to add it as zz_ctymerge.do

larsvilhuber commented 4 years ago

@andrewfoote Are you saying that merge_county and ctymerge do not do the same thing??? 🤔🙄🤪

andrewfoote commented 4 years ago

@larsvilhuber I am saying something similar to that. The code...is not great in its original form.

larsvilhuber commented 4 years ago

@andrewfoote : knocking them down as they come...

https://github.com/larsvilhuber/MobZ/blob/0900208607814df07b6c4005b5250392bebd0cc1/programs/07_adh/01_table3.log#L333-L337

andrewfoote commented 4 years ago

@larsvilhuber Something is failing in 00_07_cz_merge.do, but I don't know what.

https://github.com/larsvilhuber/MobZ/blob/96e23678ed3550f91c369537fb3c797cf5273d92/programs/07_adh/00_07_cz_merge.do#L60

Can you re-run 00_07? I just added a few QA checks in there to diagnose the problem.

larsvilhuber commented 4 years ago

Currently away from the computer. Will shortly.

-- Lars Vilhuber on mobile device

From: andrewfoote notifications@github.com Sent: Tuesday, August 18, 2020 1:32:37 PM To: larsvilhuber/MobZ MobZ@noreply.github.com Cc: Lars Vilhuber lars.vilhuber@cornell.edu; Mention mention@noreply.github.com Subject: Re: [larsvilhuber/MobZ] Getting ADH subdirectory to run (#13)

@larsvilhuberhttps://github.com/larsvilhuber Something is failing in 00_07_cz_merge.do, but I don't know what.

https://github.com/larsvilhuber/MobZ/blob/96e23678ed3550f91c369537fb3c797cf5273d92/programs/07_adh/00_07_cz_merge.do#L60

Can you re-run 00_07? I just added a few QA checks in there to diagnose the problem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/larsvilhuber/MobZ/issues/13#issuecomment-675614271, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABVSQ6BWT5EYWKDWRWA3QCDSBK3LLANCNFSM4NBFNVQA.

larsvilhuber commented 4 years ago

@andrewfoote Just updated the log file. What should it look like? What is it showing on the old runs?

andrewfoote commented 4 years ago

@larsvilhuber It...is showing that Lprime is also either zero or missing for every observation.

I just made one more change - see if that fixes the issue.

Turns out, we were using the wrong name for manufacturing_emp

larsvilhuber commented 4 years ago

@andrewfoote The 02_02 program is now running. Speed is approximately 5 min/10 iterations, so 500 min = 8h...

andrewfoote commented 4 years ago

@larsvilhuber Pretty sure we could run the regressions quietly (as well as everything else)

larsvilhuber commented 4 years ago

@andrewfoote Sure we can. And make it all more efficient... (parallel computing?) But not now. It's running... Have a look at the figures. I hope you don't need the log files... (are you writing out any tables, or saving the results as data?)

larsvilhuber / MobZ

Getting ADH subdirectory to run #13