ittshuang / phil-lisa-work

Repo for recording RP work for Phil and Lisa
0 stars 0 forks source link

### Check matching between the segment data and sites in `Public_hh_dataset_kept.dta` #2

Closed ittshuang closed 3 weeks ago

ittshuang commented 1 month ago
          ### Check matching between the segment data and sites in `Public_hh_dataset_kept.dta`

Originally posted by @ittshuang in https://github.com/ittshuang/phil-lisa-work/issues/1#issuecomment-2103125495

ittshuang commented 1 month ago

HH site-level data

Lisa uploaded a new firm-site-year level data in data\hh_dataset\publicfirm_site.dta, which contains relevant columns for matching (Seems no industry info):

So the file already contains corpid-gvkey link. There are 3521 gvkeys (firms).

Compustats segment data

Relevant columns in data\firm_info\compustats_segment_00_21.dta:

[ISSUE] Okey, after filtering fyear to 2001-2009 and only contains the 3521 gvkeys, it seems the number of firm-site (472745) >> the number of segments (33902)...

Get to the matching

It seems the year is not SOOOO important? So I first did average to get just firm-site or firm-segment level average employee and average revenue.

But after that, the it's still not obvious how should we do the matching by revenue and employee, due to:

>>> comp_seg[:10] 
    gvkey   fyear  total_sales   sid  emps  seg_sale   sic
0  001004  2001.0      638.721  10.0   NaN    99.558  5080
1  001004  2001.0      638.721  12.0   NaN   258.067  5080
2  001004  2001.0      638.721  13.0   NaN   216.727  5080
3  001004  2001.0      638.721  14.0   NaN    64.369  5080
4  001004  2002.0      606.337  18.0   NaN    93.415  5080
5  001004  2002.0      606.337  17.0   NaN    17.387  5080
6  001004  2002.0      606.337  16.0   NaN   130.628  5080
7  001004  2002.0      606.337  15.0   NaN   358.412  5080
8  001004  2003.0      651.958  16.0   NaN   163.557  5080
9  001004  2003.0      651.958  15.0   NaN   349.527  5080

>>> hh_site[:10] 
    gvkey    year  corpid     siteid  emple  reven  avg_reven
0  001004  2001.0    5689  110023187     99     14   0.141414
1  001004  2001.0    5689  107007880    104     18   0.173077
2  001004  2001.0    5689  114001316    400    344   0.860000
3  001004  2001.0    5689  123000762    150     70   0.466667
4  001004  2001.0    5689  110018783    290     60   0.206897
5  001004  2002.0    5689  114001316    400    344   0.860000
6  001004  2002.0    5689  123000762     90     42   0.466667
7  001004  2002.0    5689  110023187    130     18   0.138462
8  001004  2002.0    5689  107007880     60     11   0.183333
9  001004  2002.0    5689  110018783    200     42   0.210000

Can see the scale of the reven and seg_sale are pretty different...

ittshuang commented 1 month ago

Check geographical segment matching

It turns out the stype in compustat segment data is different for each firm (I thought firms will have both BUSSEG and GEOSEG). I re-downloaded the segment level data after removing the stype filter in sas code. But the results are still not super manageable... Even only with the geographical categories, it is not so feasible to match:

I also add industry code to compare (by adding the SICCODE and SICGRP in 2004 raw data to the processed hh_site).

Can see there are 4 segs in Compustats segment data: (there is no geoseg for this company) while 9 sites in _HHsite data. After the aggregate sales by industry, we can see the scale is still not SOOO same:

image

Even we believe the site and segment in the same industry group should be matched, the best we can do is to get siteid 16 to the 4 manufacture sites; 2 wholesale retail segments to the last one site.

ittshuang commented 3 weeks ago

Closed with no further add-on