Relationship between Geocoder.py and Shapify.py (cal_labs.shp)

kbuzard commented 2 years ago

Let's centralize the conversation about the pieces we still don't understand of what Antonio did for the May 2021 draft.

EDIT: I'll try to make one ISSUE for each separate topic.

kbuzard commented 2 years ago

Both GeoCoder.py and Shapify.py are creating the same file "cal_labs.shp" which is in the same folder.

I think GeoCoder.py was an earlier, perhaps more comprehensive script, but Antonio perhaps had some problems with it and because he was time constrained, did the geocoding elsewhere, or used geocoded_facilitities.csv and manually (or programmatically elsewhere) dropped all but the California labs when he realized he wouldn't have time to do the whole country. If so, we can probably confirm some of these details from other programs (e.g., does anything output geocoded_facilities_cal.csv?)

My evidence is mostly the timing of when files were last modified in the backup folder:

Geocoder.py (last edited: March 22, 2021) takes in matched_data.csv (last edited: 3/15/2021), geocodes it, and writes it as geocoded_facilities.csv (last edited: March 15, 2021). It then merges in cattLabs97.csv (last edited 3/5/2021), does some operations I don't quite understand yet, and writes it as LabData\cal_labs.shp (last edited: April 12, 2021), both with all states (lines 64-68) and after dropping everything except California (lines 71-81). It's possible that it was only supposed to have California it it, but there were some problems and so lines 71-81 fix that. As we discussed, everything after that is a robustness check.
shapify.py (last edited: April 12, 2021) takes in PngData\geocoded_facilities_cal.csv (last edited: April 12, 2021) and creates LabData\cal_labs.shp (last edited: April 12, 2021).

This timing means that it is unlikely that both programs were used in production of the draft we see from Antonio, but that Geocoder.py may be something we'd want to use to expand the analysis to the whole country.

kbuzard commented 2 years ago

Which '3stage_local' program was Antonio running, and which files did he use? The scripts are in "G: ...\Admin\kfunctions\Three Stage K-functions"; I moved a copy into Admin\ramosRivera\T-Burk\Python Scripts\k-functions

K-function Local_3Stage (5/8/21)
- Calls only the PLAS field lab data
K-function Local_3Stage_JIT (10/19/20):
- Points was defined on line 38 as points = r"G:\MAX-Filer\Collab\Labs-kbuzard-S18\Admin\Block Level Analysis\CA_Lab_Data.shp"
- Jorge modified it to `blocks = r"G:\MAX-Filer\Collab\Labs-kbuzard-S18\Admin\Block Level Analysis\Block Data\CA_Block_New.shp"' because that's where "CA_Block_New.shp" was located.
  - This file holds the subset of California labs that we used in the JUE/RSUE (and I don't think it was the newest one--the one at Admin\Block Level Analysis\CA_Lab_Data.shp was created during Antonio's time; I don't know why he had to create a new one).
  - I think Antonio used this to replicate the results from the papers, but for some reason switched back to using (1)

We need to verify whether 1) Antonio didn't run the analysis on the expanded set of labs (that is, all the labs from the Cattell directory instead of the subset of the top-performing labs from the paper). 2) the files that he is using in pngwork and onward are not the same files as listed in the 3-stage program. Please look at what the 3-stage program is creating as output, and see if it matches up with what is used in Antonio's scripts. @JorgeValde Please investigate to determine which thing was going on.

We will need to run this analysis on Antonio's full set of California labs. @JorgeValde Please investigate and let me know which file you think does that

If you go to the script "K-function Local_3Stage.py", which creates the shapefiles ending in "_Points_cal0.shp" and "_Bufferscal0.shp". I do not see where it is saving the results. Line 175 say "resultPoints = os.path.join(os.path.dirname(points), fieldZ +""+ fieldB + "_Points_cal0.shp")" the path is the one specify in line 22.

The output is in the directory from line 34 (os.path.dirname(points) is creating files in the directory from where the "points" object is sourced.

Both @JorgeValde and @Kirs10-Riley : While you're trying to get things to work, change the number of simulations from 999 to 5--this is just a program that takes a long time to run; you can test it out on a smaller number of simulations. It's line 63 in the non-JIT version.

kbuzard commented 2 years ago

I just found the line of code

calLabs97 = cattLabs97[cattLabs97['state_code'] == 'CA']

on Line 75 of field_org.py. Please check, but I'm pretty sure this is where calLabs97 gets created; although I'm not sure how it gets saved.

JorgeValde commented 2 years ago

the files that he is using in pngwork and onward are not the same files as listed in the 3-stage program. Please look at what the 3-stage program is creating as output, and see if it matches up with what is used in Antonio's scripts. @JorgeValde Please investigate to determine which thing was going on.

From the issue "https://github.com/kbuzard/labs/issues/51" What 3stage script Antonio was using. I am sure he is using "K-function Local_3Stage.py".

K-function Local_3Stage.py is creating all the files in the following folder: "G:\MAX-Filer\Collab\Labs-kbuzard-S18\Admin\ramosRivera\T-Burk\LabData\cal_lab_fields\PLAS"

All this files were created yesterday by me, which is only for the Plastic industry. We will have to determine what to do with line 34 of K-function Local_3Stage.py, which shapefile will work. The JIT script is using CA_Lab_Data.shp. I think this is what you are asking me the quote "We will need to run this analysis on Antonio's full set of California labs. @JorgeValde Please investigate and let me know which file you think does that".

I can try and change the directory from line 34 in K-function Local_3Stage.py, to the one in line 34 of the JIT script to run the shapefile "CA_Lab_Data.shp"

kbuzard commented 2 years ago

I can try and change the directory from line 34 in K-function Local_3Stage.py, to the one in line 34 of the JIT script to run the shapefile "CA_Lab_Data.shp"

Check out CA_Lab_Data.shp and see how many labs it has. If it's <700, then this is the subset of the labs we used in the JUE/RSUE. In that case, we need to figure out which file has all the Cattell labs that Antonio geocoded. THAT's the one we want to run this on.

JorgeValde commented 2 years ago

CA_Lab_Data.shp has 645 labs.

kbuzard commented 2 years ago

CA_Lab_Data.shp has 645 labs.

That means it's the one from the JUE/RSUE, NOT the new one Antonio was supposed to run for all of California (adding in the labs we were missing from Cattell).

JorgeValde commented 2 years ago

This is just an assumption, but maybe he run the entire California labs and then started to run each field separately in the same script. This is why we have PLAS as the last input. Looking at the files. The file cal_labs.shp located in "G:\MAX-Filer\Collab\Labs-kbuzard-S18\Admin\ramosRivera\T-Burk\LabData" has 1,745 labs. How many labs should be hoping for in California to be the correct shapefile that we want to run?

JorgeValde commented 2 years ago

Now, cal_labs.shp is comming from: pngwork.py --> cattlabs97.csv --> GeoCoder.py --> cal_labs.shp ---> (Should) K-function Local_3Stage.py

kbuzard commented 2 years ago

but maybe he run the entire California labs and then started to run each field separately in the same script. This is why we have PLAS as the last input.

I think this is exactly right!

The file cal_labs.shp located in "G:\MAX-Filer\Collab\Labs-kbuzard-S18\Admin\ramosRivera\T-Burk\LabData" has 1,745 labs. How many labs should be hoping for in California to be the correct shapefile that we want to run?

This sounds like about the right number. But why don't you look through Antonio's draft paper--he probably says in there how many there were.

pngwork.py --> cattlabs97.csv --> GeoCoder.py --> cal_labs.shp ---> (Should) K-function Local_3Stage.py

This makes sense to me!

JorgeValde commented 2 years ago

In Antonio's draft paper, there is not a clear mention on the number of labs. In one paragraph he said "over 1000 more labs than in Buzard et all". An then in Table 1 he has 1166 labs for 5 mile radius and 1224 labs for 10 miles.

It is not clear, but I think it make sense. There has to be some labs that were disregarded in the cleaning process. He was doing something with just single labs firms.

kbuzard commented 2 years ago

In Antonio's draft paper, there is not a clear mention on the number of labs. In one paragraph he said "over 1000 more labs than in Buzard et all". An then in Table 1 he has 1166 labs for 5 mile radius and 1224 labs for 10 miles.

Given this, the 1,745 labs makes sense because not all labs will end up in a cluster.

kbuzard / labs

Relationship between Geocoder.py and Shapify.py (cal_labs.shp) #50