Closed larsvilhuber closed 4 years ago
My past self thought it would be a good idea to make this file on the M Drive, rather than actually in the line of processing. Is there a tag for "shockingly bad ideas"? I will clean this up and put together the files.
@larsvilhuber Not sure if you want more details than these files, but this is a start - I can also re-create from scratch if needed.
Hm. Also cw_cty_czone.dta
:
I'm going to integrate the earlier files into the ADH data creation flow directly. QCEW should probably come from the same file we use in the QCEW folder.
@larsvilhuber I finally figured out my naming convention. cw_cty_czone is "crosswalk from county to commuting zone"
And I figured out where it came from: https://www.ddorn.net/data/cw_cty_czone.zip
I will update the read-me
@andrewfoote Next part: Need the source of the NHGIS data (old "census_together.do")
For this one, I believe that NHGIS terms of use allow to redistribute the extracted file, subject to citation. You should also describe how you extracted the file.
@larsvilhuber How should I describe that? Should I just say "extracted from NHGIS" with following citation:
Steven Manson, Jonathan Schroeder, David Van Riper, and Steven Ruggles. IPUMS National Historical Geographic Information System: Version 14.0 [Database]. Minneapolis, MN: IPUMS. 2019. http://doi.org/10.18128/D050.V14.0
@andrewfoote Yes. How big are the files?
@andrewfoote Well, strictly speaking, you need to cite the version as it was when you downloaded those files.
@andrewfoote Yes. How big are the files?
@larsvilhuber The raw data files are about 1-4 MB each.
Oh and I need to figure out which version it was in...2015?
Apparently the citation is:
Minnesota Population Center. National Historical Geographic Information System: Version 11.0 [dataset]. Minneapolis, MN: University of Minnesota, 2016. https://doi.org/10.18128/D050.V11.0
@andrewfoote OK, can you see if you can attach them as ZIP files to this ticket, and I'll put them into the right location.
@larsvilhuber Attaching here.
If kept in all the same folder, they should run. Should require a bit of re-jiggering of the 00_01_census_creation.do
file, but nothing major. Let me know if you want me to do those edits.
@andrewfoote When was this Stata code downloaded (was there a "version" line in there?) Because it bombs on various lines:
unknown egen function rowsum()
r(133);
. gen fips = statea||countya
statea| invalid name
r(198);
Can you find out what the maximum Stata "version" is that needs to be set? Tried it with Stata 14 and 16. rowsum
was replaced years ago with rowtotal
... so this is old.
@andrewfoote There are literally 5 lines added to the end of the read-in, and 4 of them don't work:
. keep fips female_emp
variable female_emp not found
r(111);
(because it is not created - it is femalepop_16_65
that is created)
Can you let me know how to correct them (what are we after here)? I can do global replace, or we can handle the modifications in the downstream programs and concentrate here on getting readin to work...
@larsvilhuber The parts that are bombing I added to the read-in code. I am going to fix them right now and drop them back into the `nhgis' folder.
@larsvilhuber Should be better now?
@andrewfoote Almost... fixed. 5cae7c6..d06c0c9
@larsvilhuber Oh geez. Too bad there isn't a facepalm emoji option in the "reactions" list.
You mean this one ?
@larsvilhuber Is this done? I think we resolved these issues, unless I am missing something.
No, not quite:
where do those three "cty_industry" files come from?
(Don't edit the file I'm pointing to, I'm working on it"
And another thing @andrewfoote
. keep pop* female* fips manu_emp total_emp bachelors year
variable bachelors not found
r(111);
in the 1970s data. Because the Stata programs don't define variable names, not sure which one it is. It's defined for the other years.
@andrewfoote more data:
which one is that (qcew employment or earnings)?
No, not quite:
where do those three "cty_industry" files come from?
(Don't edit the file I'm pointing to, I'm working on it"
The cty_industry files were provided by David Dorn, but I can't find the email because of the awful Outlook search feature. I do have the files, which I can drop somewhere.
@andrewfoote more data:
which one is that (qcew employment or earnings)?
@larsvilhuber This should be the employment, total and manufacturing
@andrewfoote One more thing:
This is a pretty bad one. I think that this is the seer county population estimates, read in and then made into a county-year dataset with two variables: pop_16_65 and totalpop, from 1990 to 2015.
@larsvilhuber I can provide code for that, but I don't know how to live-extract it onto ECCO, although I can give you a source: https://seer.cancer.gov/popdata/yr1990_2018.singleages/us.1990_2018.singleages.adjusted.txt.gz
@andrewfoote more data: https://github.com/larsvilhuber/MobZ/blob/0ae714b87396d2c81452158149432c1658dacd7d/programs/07_adh/00.02.mergecounty.do#L65
which one is that (qcew employment or earnings)?
@larsvilhuber This should be the employment, total and manufacturing
So "qcew_county.dta" file?
@andrewfoote One more thing: https://github.com/larsvilhuber/MobZ/blob/0ae714b87396d2c81452158149432c1658dacd7d/programs/07_adh/00.02.mergecounty.do#L36
This is a pretty bad one. I think that this is the seer county population estimates, read in and then made into a county-year dataset with two variables: pop_16_65 and totalpop, from 1990 to 2015.
@larsvilhuber I can provide code for that, but I don't know how to live-extract it onto ECCO, although I can give you a source: https://seer.cancer.gov/popdata/yr1990_2018.singleages/us.1990_2018.singleages.adjusted.txt.gz
Give me the code, I can handle the download part on ECCO.
No, not quite: https://github.com/larsvilhuber/MobZ/blob/7af0a5ba7c90e55dbf5a9362181b36d8129264bb/programs/07_adh/00_01_census_creation.do#L74
where do those three "cty_industry" files come from? (Don't edit the file I'm pointing to, I'm working on it"
The cty_industry files were provided by David Dorn, but I can't find the email because of the awful Outlook search feature. I do have the files, which I can drop somewhere.
David Dorn's website has a bunch of files, but because they are in ZIP, hard to know what's in them. If you can read through https://www.ddorn.net/data.htm and identify the relevant ZIP file, that would probably help.
In Outlook, try "from:dorn"
I literally cannot find the email. However, I think he created the files using these imputation files on his website:
https://www.ddorn.net/data/cbp1980_imputations.zip https://www.ddorn.net/data/cbp1990_imputations.zip https://www.ddorn.net/data/cbp2000_imputations.zip
@andrewfoote Do you still have the file, and can we get it out from Census? Or do we have to recreate it?
I still have the files - I can drop them into repository.
This is a pretty bad one. I think that this is the seer county population estimates, read in and then made into a county-year dataset with two variables: pop_16_65 and totalpop, from 1990 to 2015. @larsvilhuber I can provide code for that, but I don't know how to live-extract it onto ECCO, although I can give you a source: https://seer.cancer.gov/popdata/yr1990_2018.singleages/us.1990_2018.singleages.adjusted.txt.gz
Give me the code, I can handle the download part on ECCO.
@andrewfoote I have downloaded the SEER file, but don't have the read-in code. Can you upload the popcounts.dta file directly, for now? Loose end that we can handle afterwards (see #33)
which one is that (qcew employment or earnings)?
@larsvilhuber This should be the employment, total and manufacturing
So "qcew_county.dta" file?
This file has annual_avg_empl
(by year fips naics2). Can you suggest a transformation that gives the desired structure? Simply subset by manufacturing?
Change 7a50eb2 would be for this ticket.
@andrewfoote : Looking at https://github.com/larsvilhuber/MobZ/blob/7a50eb257aacd5b4c72718e7a8f8aea52bcc0b39/programs/07_adh/02.01.cutoff_loop.do#L25 and following lines (also in 02.02.overall_loop.do
):
there are still a few lines in there which call programs that do not exist as such:
02.01.cutoff_loop.do:include "$dodir/replication/iteration/aggregatedata.do"
02.01.cutoff_loop.do: include "$dodir/county_merge.do";
I'm guessing that
Can you confirm?
which one is that (qcew employment or earnings)?
@larsvilhuber This should be the employment, total and manufacturing
So "qcew_county.dta" file?
This file has
annual_avg_empl
(by year fips naics2). Can you suggest a transformation that gives the desired structure? Simply subset by manufacturing?
I just checked something in to do this, which is in /07_adh/ and should probably be renamed into the sequence.
This is a pretty bad one. I think that this is the seer county population estimates, read in and then made into a county-year dataset with two variables: pop_16_65 and totalpop, from 1990 to 2015. @larsvilhuber I can provide code for that, but I don't know how to live-extract it onto ECCO, although I can give you a source: https://seer.cancer.gov/popdata/yr1990_2018.singleages/us.1990_2018.singleages.adjusted.txt.gz
Give me the code, I can handle the download part on ECCO.
@andrewfoote I have downloaded the SEER file, but don't have the read-in code. Can you upload the popcounts.dta file directly, for now? Loose end that we can handle afterwards (see #33)
Doing this right now.
@andrewfoote Progress!
@andrewfoote : Looking at
and following lines (also in
02.02.overall_loop.do
): there are still a few lines in there which call programs that do not exist as such:02.01.cutoff_loop.do:include "$dodir/replication/iteration/aggregatedata.do" 02.01.cutoff_loop.do: include "$dodir/county_merge.do";
I'm guessing that
- aggregatedata.do -> 00_07_aggregatedata.do (but see the last line, which seemed to write out a file that isn't used anywhere else, ambiguity about "industry_data" vs. "industrydata"
- county_merge.do -> 00_05_mergecounty.do
Can you confirm?
I believe aggregatedata.do --> zz_aggregatedata
and county_merge is missing. I going to add it as zz_ctymerge.do
@andrewfoote Are you saying that merge_county and ctymerge do not do the same thing??? 🤔🙄🤪
@larsvilhuber I am saying something similar to that. The code...is not great in its original form.
@andrewfoote : knocking them down as they come...
@larsvilhuber Something is failing in 00_07_cz_merge.do, but I don't know what.
Can you re-run 00_07? I just added a few QA checks in there to diagnose the problem.
Currently away from the computer. Will shortly.
-- Lars Vilhuber on mobile device
From: andrewfoote notifications@github.com Sent: Tuesday, August 18, 2020 1:32:37 PM To: larsvilhuber/MobZ MobZ@noreply.github.com Cc: Lars Vilhuber lars.vilhuber@cornell.edu; Mention mention@noreply.github.com Subject: Re: [larsvilhuber/MobZ] Getting ADH subdirectory to run (#13)
@larsvilhuberhttps://github.com/larsvilhuber Something is failing in 00_07_cz_merge.do, but I don't know what.
Can you re-run 00_07? I just added a few QA checks in there to diagnose the problem.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/larsvilhuber/MobZ/issues/13#issuecomment-675614271, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABVSQ6BWT5EYWKDWRWA3QCDSBK3LLANCNFSM4NBFNVQA.
@andrewfoote Just updated the log file. What should it look like? What is it showing on the old runs?
@larsvilhuber It...is showing that Lprime is also either zero or missing for every observation.
I just made one more change - see if that fixes the issue.
Turns out, we were using the wrong name for manufacturing_emp
@andrewfoote The 02_02 program is now running. Speed is approximately 5 min/10 iterations, so 500 min = 8h...
@larsvilhuber Pretty sure we could run the regressions quietly (as well as everything else)
@andrewfoote Sure we can. And make it all more efficient... (parallel computing?) But not now. It's running... Have a look at the figures. I hope you don't need the log files... (are you writing out any tables, or saving the results as data?)
cty_census.dta was created using a number of different inputs: QCEW, census extracts, and population counts from SEER.