Closed ittshuang closed 3 weeks ago
Lisa uploaded a new firm-site-year level data in data\hh_dataset\publicfirm_site.dta
, which contains relevant columns for matching (Seems no industry info):
siteid
: HH site idyear
: 2001 to 2009emple
: employeecorpid
: HH firm IDreven
avg_reven
gvkey
: compustat identifierSo the file already contains corpid-gvkey link. There are 3521 gvkeys (firms).
Relevant columns in data\firm_info\compustats_segment_00_21.dta
:
gvkey
fyear
: 2001 to 2021emps
: employeeseg_sale
: segment-level sales[ISSUE] Okey, after filtering fyear to 2001-2009 and only contains the 3521 gvkeys, it seems the number of firm-site (472745) >> the number of segments (33902)...
It seems the year is not SOOOO important? So I first did average to get just firm-site or firm-segment level average employee and average revenue.
But after that, the it's still not obvious how should we do the matching by revenue and employee, due to:
gvkey=001004
>>> comp_seg[:10]
gvkey fyear total_sales sid emps seg_sale sic
0 001004 2001.0 638.721 10.0 NaN 99.558 5080
1 001004 2001.0 638.721 12.0 NaN 258.067 5080
2 001004 2001.0 638.721 13.0 NaN 216.727 5080
3 001004 2001.0 638.721 14.0 NaN 64.369 5080
4 001004 2002.0 606.337 18.0 NaN 93.415 5080
5 001004 2002.0 606.337 17.0 NaN 17.387 5080
6 001004 2002.0 606.337 16.0 NaN 130.628 5080
7 001004 2002.0 606.337 15.0 NaN 358.412 5080
8 001004 2003.0 651.958 16.0 NaN 163.557 5080
9 001004 2003.0 651.958 15.0 NaN 349.527 5080
>>> hh_site[:10]
gvkey year corpid siteid emple reven avg_reven
0 001004 2001.0 5689 110023187 99 14 0.141414
1 001004 2001.0 5689 107007880 104 18 0.173077
2 001004 2001.0 5689 114001316 400 344 0.860000
3 001004 2001.0 5689 123000762 150 70 0.466667
4 001004 2001.0 5689 110018783 290 60 0.206897
5 001004 2002.0 5689 114001316 400 344 0.860000
6 001004 2002.0 5689 123000762 90 42 0.466667
7 001004 2002.0 5689 110023187 130 18 0.138462
8 001004 2002.0 5689 107007880 60 11 0.183333
9 001004 2002.0 5689 110018783 200 42 0.210000
Can see the scale of the reven
and seg_sale
are pretty different...
It turns out the stype in compustat segment data is different for each firm (I thought firms will have both BUSSEG and GEOSEG). I re-downloaded the segment level data after removing the stype filter in sas code. But the results are still not super manageable... Even only with the geographical categories, it is not so feasible to match:
I also add industry code to compare (by adding the SICCODE and SICGRP in 2004 raw data to the processed hh_site).
gvkey=001004
:
Compustats segment data:
Can see there are 4 segs in Compustats segment data: (there is no geoseg for this company) while 9 sites in _HHsite data. After the aggregate sales by industry, we can see the scale is still not SOOO same:
Even we believe the site and segment in the same industry group should be matched, the best we can do is to get siteid 16 to the 4 manufacture sites; 2 wholesale retail segments to the last one site.
Closed with no further add-on
Public_hh_dataset_kept.dta
is already at firm-year level with HH variables likecount of site
, and other IS variables. 🤔SITEDESC
andCORP
contains basic information about site and firm, along with the linking inSITEDESC
.SITEDESC
: SICCODE (SIC code (4-Digit)), EMPLE (Site Employees), REVEN (estimated revenue), TURNOVER-LOCAL (Estimated Annual Turnover (Revenue) in Local Currency) - Match with corresponding variables incompustats_segment_00_21.dta
Originally posted by @ittshuang in https://github.com/ittshuang/phil-lisa-work/issues/1#issuecomment-2103125495