kbuzard commented 2 years ago

@Kirs10-Riley : please log your work daily here using the following format:

Tuesday, May 31

Hours Worked Today:
- Total Hours Worked:
- Hours Worked this week:
Tasks that I am assigned: Read and comment on May 2021 draft; Read Antonio's documentation; Read Dylan and Kelly's documentation; Catalog contents of ramosRivera folder on G: drive
Reflection:

Kirs10-Riley commented 2 years ago

Kirsten's Daily Log

Monday, July 4

Hours Worked Today: 2
- Total Hours Worked: 77
- Hours Worked this week: 2
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension, figure out what was left on Dylan and Kelly's work
Reflection: I went through Dylan and Kelly's work. They both finished the tasks that Antonio gave them for the cattLabs97_Wgeocode. Kelly completed digitizing and cleaning the 1989 file. Dylan completed digitizing the 1979 data but did not clean it. I also tried to create the environment to run the python files but I kept getting errors.

Friday, July 1

Hours Worked Today: 6
- Total Hours Worked: 75
- Hours Worked this week: 19
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension, figure out what was left on Dylan and Kelly's work
Reflection: I finished the G-drive corrections and linked the duplicate files. I still cannot open the .gz files but I am sure they are geocode for each state + DC.

Thursday, June 30

Hours Worked Today: 5
- Total Hours Worked: 69
- Hours Worked this week: 14
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension, figure out what was left on Dylan and Kelly's work
Reflection: I cataloged the nhgis003_shapefiles_tl2000_560_block_2000 folder under the BlockData folder in the T-burk folder. There was 358 documents in there so it was a good thing that Prof. Buzard found that. Found a lot of gz files and email Jorge on how to open them. He sent me a link and I will try opening them tomorrow.

Wednesday, June 29

Hours Worked Today: 4
- Total Hours Worked: 64
- Hours Worked this week: 9
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension, figure out what was left on Dylan and Kelly's work
Reflection: working on the G-drive.

Tuesday, June 28

Hours Worked Today: 4
- Total Hours Worked: 60
- Hours Worked this week: 5
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension, figure out what was left on Dylan and Kelly's work
Reflection: Met with Prof. Buzard and Jorge we went through the G-drive and list and figured out where the missing documents could be located. I fixed the python tutorial and started adding the file types to the g-Drive. My computer keeps reloading the desktop so it has been a very slow process.

Monday, June 27

Hours Worked Today: 1
- Total Hours Worked: 57
- Hours Worked this week: 1
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension
Reflection: went through the comments left on the items submitted last Friday. Tried to see if I could find the missing documents, they are still not there. Maybe they were saved under a different name or where put in a different area.

Friday, June 24

Hours Worked Today: 6
- Total Hours Worked: 56
- Hours Worked this week: 17
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension
Reflection: Uploaded the How-to script as a discussion. Completed updating the G-Drive documentation with the python scripts, and input and output notes. I submitted a pull request for this file. I also created a discussion called Disparities in the G-drive, where I outlined the files I could not find in the Ramos Rivera Folder.

Thursday, June 23

Hours Worked Today: 3
- Total Hours Worked: 50
- Hours Worked this week: 11
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension
Reflection: Created a How-to script in Python.

Wednesday, June 22

Hours Worked Today: 2
- Total Hours Worked: 47
- Hours Worked this week: 8
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension
Reflection: Started reading up on how to run scripts on python, continued reading the python script one-note.

Tuesday, June 21

Hours Worked Today: 3
- Total Hours Worked: 45
- Hours Worked this week: 6
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension
Reflection: Met with Professor Buzard and Jorge. After the meeting I added a section to the G-drive pdf on how to open .shp, .xml, and python files then I started going through Jorge's one-note file on the python scripts.

Monday, June 20

Hours Worked Today: 3
- Total Hours Worked: 42
- Hours Worked this week: 3
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, create a how to do geospatial analysis guide, help Jorge.
Reflection: Edited the How-to by getting rid of LineString, fixed the bolded headers, and changed the titles into Links. I then continued tracking down files and programs.

Friday, June 17

Hours Worked Today: 7
- Total Hours Worked: 39
- Hours Worked this week: 14
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, create a how to do geospatial analysis guide, help Jorge.
Reflection: I finished the How to guide and created a discussion on the Labs Repo. I emailed Jorge and started tracking which files and programs are producing the tables and maps in Antonio's Draft.

Wednesday, June 15

Hours Worked Today: 2
- Total Hours Worked: 32
- Hours Worked this week: 7
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, create a how to do geospatial analysis guide, help Jorge.
Reflection: submitted a pull request for the catalogue and started looking at the python files to see what to add to the how to guide. I found a detailed online class that goes over Geospatial analysis with Python and R. I will use this for the guide.

Tuesday, June 14

Hours Worked Today: 5
- Total Hours Worked: 30
- Hours Worked this week: 16
Tasks that I am assigned: fix the Catalog contents of ramosRivera folder on G: drive, create a how to do geospatial analysis guide, help Jorge.
Reflection: During the weekly meeting we looked over the G-drive content and some of the files that I was not able to open. After I was able to open the files under the t-burk folder by opening it up in arcgis. I then condensed the accompanying files and shape files into one line and reformatted Jorge's section.

Sunday, June 12

Hours Worked Today: 3
- Total Hours Worked: 25
- Hours Worked this week: 16
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive
Reflection: I am having some troubles opening up the files in my section of the T-burk file. I emailed Jorge to see if it's just a problem on my end. I submitted a pull request for the updated G-Drive Catalog.

Thursday, June 9

Hours Worked Today: 3
- Total Hours Worked: 22
- Hours Worked this week: 13
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive
Reflection: continued working on the T-Burk file should be completed by tomorrow night.

Wednesday, June 8

Hours Worked Today: 3
- Total Hours Worked: 19
- Hours Worked this week: 10
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive
Reflection: I finished documenting the unknown files. It took a little longer than expected to get them to open, I put the platform I used to open the document next to the file. I will re-upload the folder after I complete my section of the T-burk file.

Tuesday, June 7

Hours Worked Today: 2
- Total Hours Worked: 16
- Hours Worked this week: 7
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive
Reflection: met with Prof. Buzard and Jorge for our weekly meeting and went over the G-drive. I now have access to the 04 computer which will help me to gain access to some of the documents. I started going back and figuring out the files I did not have access to. Once I finished, I will start on the T-burk folder from the bottom. I am still having troubles opening the unknown files on the 04 computer, I am having a lot of trouble opening the last file in the bg06_d00 file. I got it to open on excel but it does not make sense. I will try again tomorrow.

Monday, June 6

Hours Worked Today: 5
- Total Hours Worked: 14
- Hours Worked this week: 5
Tasks that I am assigned: Read and comment on May 2021 draft; Read Antonio's documentation; Read Dylan and Kelly's documentation; Catalog contents of ramosRivera folder on G: drive
Reflection: Finished cataloging the ramosRIvera folder and created a pull request for the r-markdown file. Started looking at the content on Github and will make a separate document for that content.

Thursday, June 2

Hours Worked Today: 2
- Total Hours Worked: 9
- Hours Worked this week: 9
Tasks that I am assigned: Read and comment on May 2021 draft; Read Antonio's documentation; Read Dylan and Kelly's documentation; Catalog contents of ramosRivera folder on G: drive
Reflection: Met with Jorge on Zoom to split up the cataloguing. He is doing the T-Burk folder and I will be logging the rest in the R-markdown file I started.

Wednesday, June 1

Hours Worked Today: 5
- Total Hours Worked: 7
- Hours Worked this week: 7
Tasks that I am assigned: Read and comment on May 2021 draft; Read Antonio's documentation; Read Dylan and Kelly's documentation; Catalog contents of ramosRivera folder on G: drive
Reflection: Today I re-read the two published papers by Kristy to get a better understanding of Antonio's draft. I Read the May 2021 draft and commented (I may add more the more I refer back to it) - overall it was a solid rough draft and I think it is a good start to use, some parts were hard to comprehend but that maybe due to my lack of background on the subject. I read through Antonio's documentation did not understand most of it, read through Dylan and Kelly's Documentation and there work, and started looking through the ramosRivera folder. I will email Jorge in the morning and work out a time to meet to figure out a method to catalog the drive.

Tuesday, May 31

Hours Worked Today: 2
- Total Hours Worked: 2
- Hours Worked this week: 2
Tasks that I am assigned: Read and comment on May 2021 draft; Read Antonio's documentation; Read Dylan and Kelly's documentation; Catalog contents of ramosRivera folder on G: drive
Reflection: I met with Prof. Buzard and Jorge to discuss the scope of the project, got access to the G-Drive and the Labs repo, and assign tasks. Jorge and I decided to meet later this week, after reading the papers and going through the contents, to discuss how we plan on cataloging/documenting the contents. After the meeting I spent some time going through the G:drive and the Labs repo to familiarize myself with the content, some of the files on the G: drive do not open.

kbuzard commented 2 years ago

Thanks @Kirs10-Riley ! Keep a list of the files that won't open and check to see if Jorge knows what they are. If they mostly have the same couple of extensions (e.g., .py), it likely is an issue with not having a program you need on the computer you're working on. I will prioritize getting you access to a virtual machine that has the right software.

kbuzard commented 2 years ago

@Kirs10-Riley : My notes say that you were granted access to MAX-KBLABS-03 (the virtual machine that has OmniPage on it) last summer. Will you check to see if you can access it?

To do so, you need to get logged in at rds.syr.edu; then open "Remote Desktop Connection" and put in "MAX-KBLABS-03.ad.syr.edu" for the computer name. On the next popup window, put in your usual NetID password. Please let me know whether it works or not.

Kirs10-Riley commented 2 years ago

@Kirs10-Riley : My notes say that you were granted access to MAX-KBLABS-03 (the virtual machine that has OmniPage on it) last summer. Will you check to see if you can access it?

To do so, you need to get logged in at rds.syr.edu; then open "Remote Desktop Connection" and put in "MAX-KBLABS-03.ad.syr.edu" for the computer name. On the next popup window, put in your usual NetID password. Please let me know whether it works or not.

It works!

Kirs10-Riley commented 2 years ago

Tuesday, July 5

Hours Worked Today: 7
- Total Hours Worked: 84
- Hours Worked this week: 9
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension, figure out what was left on Dylan and Kelly's work
Reflection: I met with professor Buzard and Jorge for the weekly meeting. Prof. Buzard figured out what the .gz files are for. After I met with Jorge and he showed me how to set up the environment. For Dylan and Kelly's work they both completed going over the corr_cattLab97_Wgeocode. Dylan corrected the first 6200 lines and Kelly completed line 6200- 12765. Dylan digitized but was not able to clean the 1979 data and Kelly digitized and cleaned the 1989 data. I uploaded the newest version of the G-drive and noted four missing files. For more information go to the G-Drive Disparities discussion. I creating the python environment to run the scripts and updated the "How to Run Python Scripts" guide. I ran into one problem when creating the environment. The program asked for an admin's username and password to give permission for anaconda navigator to install Spyder, mine did not work.

Kirs10-Riley commented 2 years ago

reflection continued: I refreshed the page and spyder is installed. Now my problem is getting Spyder to stop forcing me to fill out an internal problem report or close Spyder.

kbuzard commented 2 years ago

reflection continued: I refreshed the page and spyder is installed. Now my problem is getting Spyder to stop forcing me to fill out an internal problem report or close Spyder.

You can contact ictresearch@syr.edu for help with this kind of troubleshooting. Whenever you need admin approval, they'll have to do the thing. They're very responsive. Just carefully explain the issue and make sure to tell them which virtual machine you're working on. @JorgeValde This goes for you too.

Kirs10-Riley commented 2 years ago

Wednesday, July 6

Hours Worked Today: 4 Total Hours Worked: 88 Hours Worked this week: 13 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension, figure out what was left on Dylan and Kelly's work Reflection: Ran through 9 python scripts. I am keeping a list of which ones work and which ones come back with an error to compare with Jorge.

Kirs10-Riley commented 2 years ago

Thursday, July 7

Hours Worked Today: 4 Total Hours Worked: 92 Hours Worked this week: 13 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension, figure out what was left on Dylan and Kelly's work Reflection: Created a new environment with the another package and started to rerun the python scripts.

Kirs10-Riley commented 2 years ago

Saturday, July 9

Hours Worked Today: 4 Total Hours Worked: 96 Hours Worked this week: 17 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge.
Reflection: I've been fighting a lot with Spyder but I think I got the program to work! I finished running the python scripts and sent my notes on each file to Jorge for him to look over. 10/20 of the scripts I was able to find on Spyder run successfully and our outcomes were almost identical as well. My notes are attached below

Python Scripts (Kirsten).docx

kbuzard commented 2 years ago

Anywhere you have a "module not found" error, it means you need to install the relevant module in your environment.

Most of the other errors look like what I would call sequencing problems--because the scripts aren't run in the right order (so the output from one script hasn't yet been created, and it's needed as an input). This sequencing is what @JorgeValde needs to figure out and document.

kbuzard commented 2 years ago

And I should have said! Congrats! This is a big step!

Kirs10-Riley commented 2 years ago

Tuesday, July 12

Hours Worked Today: 2 Total Hours Worked: 98 Hours Worked this week: 2 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge. Reflection: Completed corrections on the G-Drive catalog. I made the links to the duplicated the titles. The files that said duplicate have no original outside of Dylan and Kelly's folder so I made the "originals" in Dylan's folder because his was first. The duplicates are not involved in any of the programs. But Kelly's corr_cattLabs97_Wgeocode_Line 6200 to Line 12765 document was last edited in august where all of the other corr_cattLabs97_Wgeocode were edited before that. This could mean that the "master copy" in the pngData Folder could not be completely corrected.

Kirs10-Riley commented 2 years ago

Wednesday, July 13

Hours Worked Today: 6 Total Hours Worked: 104 Hours Worked this week: 8 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge. Reflection: I read Jorge's notes that he sent me and re-ran the Error scripts with the improved environment. Submitted a pull-request of the corrected rmd and pdf for the G-drive catalog, Had to go back and make some last minute edits before re-uploading to my repository. Had a team meeting where we figured/ fixed most the scripts that did not run the first two times. I submitted a second pull-request with the changes from the G-Drive that were made in the meeting.

kbuzard commented 2 years ago

Thanks @Kirs10-Riley ! The G drive document looks good--I just merged it!

Kirs10-Riley commented 2 years ago

Thursday, July 14

Wednesday, July 13 Hours Worked Today: 1 Total Hours Worked: 105 Hours Worked this week: 9 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge. Reflection: Updated the discussion on "How to run a Python Script" with the up to date instructions on how to set up the environment for the ramosRivera Python Files. I realized that the pdf I submitted a pull-request for yesterday was not the correct one with the changes made yesterday. I Uploaded and submitted the correct pdf.

Kirs10-Riley commented 2 years ago

Friday, July 15

Wednesday, July 13 Hours Worked Today: 3 Total Hours Worked: 108 Hours Worked this week: 12 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge, Kelly investigation Reflection: started re-running the scripts, so far my outcomes are identical to the notes Jorge sent me yesterday. I will finish re-running all of them before our next meeting and update my notes. I also met with Prof. Buzard and Jorge to troubleshoot the three python scripts that came back with errors. I was also assigned a new task to find out as much as I can on the digitization process of the 1989 data. I have not started that task yet.

Kirs10-Riley commented 2 years ago

Monday, July 18

Hours Worked Today: 6 Total Hours Worked: 114 Hours Worked this week: 6 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge, Kelly investigation Reflection: I looked into the process of the 1989 data and put my initial thoughts and information into a discussion board. I also finished re-running the scripts and here are my notes ->

Python Scripts (1)Kirsten.docx

Kirs10-Riley commented 2 years ago

Tuesday, July 19

Hours Worked Today: 2 Total Hours Worked: 116 Hours Worked this week: 8 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge, Kelly investigation, find cal_labs97.csv Reflection: Meeting and I looked at Dylan's notes. There was a lot on the the Omipage process but not much insight to what Antonio's process was.

Friday, July 22

Hours Worked Today: 4 Total Hours Worked: 116 Hours Worked this week: 12 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge, Kelly investigation, find cal_labs97.csv Reflection:

The closest csv file that I could find to calLabs97 was CattLabs97.csv. It has all the same variables except for unnamed 0_x, unnamed 0.1, and address. There are 12,779 labs and the file is 1335KB

@JorgeValde also wanted me to look into how four files were made, these were my notes:

matched_data.csv that is used in GeoCoder.py
- This file was last modified on 3/09/2021, similar files like matched_data_I.csv were last modified on 4/12/2021
- I did a search for all files made on this date and before. All the python files I found had matched_data as an import, except for K-function Local_3stage.py
- K-function Local_3stage.py imports: CA_Block_Data.shp, CA_ZCTA_Data.shp, and cal_lab_fields/PLAS/PLAS.shp. I tired running this to see what it does but after an hour it was only 2% complete and I stoped the file.
field_lab_counts2 used in stat_calc.py
- made 5/3/21 2:25pm
- the following files were all made the day before on 5/02/21 at 2:52-2:53: clustpatents, originating, cite76_06, and pat76_o6_assg. These are all stata files and mights relate.
- another file that could relate is field.sas
- checked similar files and this file is no different from field_lab_counts
cite_same.csv used in statcalc.py
- made 5/5/2021 12:07
- same variables as originating made on 5/02/21
cal_labs97.csv
- have not looked into this one yet
If they are confusing I can add more details.

kbuzard commented 2 years ago

Thanks for all of this @Kirs10-Riley . I integrated the parts that I knew were relevant to answering @JorgeValde's questions; I'm sure he'll get back to you on the others.

Kirs10-Riley commented 2 years ago

Monday, July 25

Hours Worked Today: 3 Total Hours Worked: 119 Hours Worked this week: 3 Tasks that I am assigned: Finish digitizing 1989 Reflection: I looked for cal_labs97.csv and did not find anything on how it was made. I met with prof. Buzard and Jorge to discus what we have been focusing on and the next steps. I am going to start going to finish the digitization of the 1989 data that Kelly and Antonio started.

Kelly's Doc.docx Dylan's documentation.docx

Kirs10-Riley commented 2 years ago

Wednesday, July 27

Hours Worked Today: 3 Total Hours Worked: 122 Hours Worked this week: 6 Tasks that I am assigned: Finish digitizing 1989 Reflection: I created a folder in the Python Script folder called 1989 where all of the copied scripts are, but I am thinking of moving this folder somewhere else, most likely under Kelly's work. I need to find/ make a lot of files for the data to be digitized and it might be better to have it more centralized. started looking the Address_ID script and my notes are below. I only got to line 22 before I got stuck, my notes are attached below. I will look into this more tomorrow and hopefully an epiphany will be made.

1989_Digitizaition_Notes.pdf

Kirs10-Riley commented 2 years ago

Thursday, July 28

Hours Worked Today: 1 Total Hours Worked: 123 Hours Worked this week: 7 Tasks that I am assigned: Finish digitizing 1989 Reflection: I added the new notes into my documentation and moved the folder from the python folder into the ramosRivera folder. I also added Kelly's work and the files I found in Ivan png's website that pertained to 1989.

Monday, August 1

Hours Worked Today: 5 Total Hours Worked: 128 Hours Worked this week: 5 Tasks that I am assigned: Finish digitizing 1989 Reflection: Notes are attached below. I will created a .md file tomorrow. I was able to create cattLabs89 but the contents are empty. I figured out why and tried to figure out a way to create 89 versions of cattell-all and field and documented those trials under pngwork.py outcome.

1989_Digitizaition_Notes.pdf

Kirs10-Riley commented 2 years ago

Tuesday, August 2

Hours Worked Today: 4 Total Hours Worked: 132 Hours Worked this week: 9 Tasks that I am assigned: Finish digitizing 1989 Reflection: Met with Prof. Buzard and created cattlabs97.csv. Emailed ITC research and they told me the most likely scenario is that my environment is corrupted and recommended I delete that environment and create a new one. I successfully deleted both Kirsten_Envi and Kirsten_Envi1 and started creating Envi_Kirsten. I left off in the middle of reinstalling GeoPandas.

Wednesday, August 3

Hours Worked Today: 4 Total Hours Worked: 136 Hours Worked this week: 136 Tasks that I am assigned: Finish digitizing 1989 Reflection: Finished creating the environment and launched the environment, tried opening the Spyder and the same thing occurred. I emailed them again, until then I will just be working on the default Spyder environment, I do not think Address_Id requires any of those packages. documentation on wiki has been updated to include up to line 34 of Address_ID.

Kirs10-Riley commented 2 years ago

Thursday, August 4

Hours Worked Today: .5 Total Hours Worked: 136.5 Hours Worked this week: 13.5 Tasks that I am assigned: Finish digitizing 1989 Reflection: emailed back in forth with ITC, they don't know whats going on either but have a few ideas. I am meeting with them tomorrow morning.

Friday, August 5

Hours Worked Today: 2.5 Total Hours Worked: 139 Hours Worked this week: 16 Tasks that I am assigned: Finish digitizing 1989 Reflection: Met with ITC... Twice. After our second meeting they told me log out and they'll go in after lunch and do a couple of things. So I'll be Spyder-less until then.

Kirs10-Riley commented 2 years ago

Monday

Hours Worked Today: 5 Total Hours Worked: 144 Hours Worked this week: 5 Tasks that I am assigned: Finish digitizing 1989 Reflection: I heard back from ITC, they said everything should work, my environments are gone and I started creating a new one but worked from the default environment while that was loading. I got to line 117 before I hit an error. I'll go over my finding tomorrow with Jorge during out meeting.

Kirs10-Riley commented 2 years ago

Tuesday, Aug 10

Hours Worked Today: 2 Total Hours Worked: 146 Hours Worked this week: 7 Tasks that I am assigned: Finish digitizing 1989 Reflection: Met with Jorge and he figured out the error I received on line 117. I made a not next to the new added line.

Wednesday, Aug 11

Hours Worked Today: 2 Total Hours Worked: 148 Hours Worked this week: 9 Tasks that I am assigned: Finish digitizing 1989 Reflection: Finished up to line 130, Notes have been added to the wiki.

Kirs10-Riley commented 2 years ago

Friday, Aug 13

Hours Worked Today: 5 Total Hours Worked: 153 Hours Worked this week: 14 Tasks that I am assigned: Finish digitizing 1989 Reflection: Completed Address_ID_1989 🥳 and wrote the final two lines that saves the data to an csv file called Address_ID89.csv in the 1989 folder. The wiki has been updated.

kbuzard commented 2 years ago

Completed Address_ID_1989 🥳 and wrote the final two lines that saves the data to an csv file called Address_ID89.csv in the 1989 folder

It looks GOOD @Kirs10-Riley !!!

I noticed a couple of things that might create problems when geocoding (sometimes there's no comma between the company name and the address; some entries have an asterisk just before the address and some do not), but I think we should try running this through the geocoder and see how it goes.

I think that means that we should be figuring out GeoCoder.py next (@JorgeValde Does this make sense to you?)

If so, I think the first thing to do is to make sure we have all the correct input files (some of them may still need to be created, or modified; e.g. when Antonio says to import the address data, it's matched_data.csv; we need to see whether we should directly substitute Address_ID89.csv, or if we're going to have to modify it before it will work (does it have the same columns, etc).

Kirs10-Riley commented 2 years ago

Tuesday, Aug 16

Hours Worked Today: 3 Total Hours Worked: 156 Hours Worked this week: 3 Tasks that I am assigned: Finish digitizing 1989 Reflection: Meeting and started looking at the C.py script and marking things that might help the process in address_ID

Wednesday, Aug 17

Hours Worked Today: 4 Total Hours Worked: 159 Hours Worked this week: 4 Tasks that I am assigned: Finish digitizing 1989 Reflection: I ran into some problems creating matched_data and will probably need some assistance to figure out the next step. I updated my notes with a brief summary of what I did.

kbuzard commented 2 years ago

a brief summary of what I did

My best guess is that you're absolutely right that it has something to do with headquarters. I suggest you find a few examples of the locations that are missing from the smaller dataset, and then look them up in the PDF of the directory. If they don't have information about R&D, then they were in the directory only to show the company structure and they're not actually a lab.

If this is right, then we have to figure out how to drop the non-lab observations. It might be the ones that have asterisks in them; or there might be a variable that is systematically missing from those observations (like Field of R&D); or Antonio might have connected to the field data; I would imagine that only the locations with R&D show up in those lists..

Kirs10-Riley commented 2 years ago

Thursday, Aug 18

Hours Worked Today: 5 Total Hours Worked: 161 Hours Worked this week: 5 Tasks that I am assigned: Finish digitizing 1989 Reflection:

Met with professor Buzard about creating matched_data
I looked back at Address_ID and dataframe1 with the string_data only has 6649 observations while catfacilities has 11,319 observations. He does not include the subs in either of these. I also looked back and Dylan and Kelly's work and they only edited the 97 csv file. So there might be some mistakes in 89 similar to A125.

kbuzard commented 2 years ago

In the map that Antonio created for the whole U.S. in 1998, there are 10,346 points mapped (that's consistent with what he told me that he wasn't able to geocode around 10% of the labs)--so they all got done somehow. If we can figure out how he got to this, it could be very helpful.

Do either you or @JorgeValde have any idea how he made this map? It would help @Kirs10-Riley and me understand the problems we're having with getting all the lab addresses in 1989.

JorgeValde commented 2 years ago

The python script called "Prep_labs.py" I believe it creates the file USA_labs_2000.shp I used that one to create the map I made and I took that file from the backup documentation. The python script "Prep_ZBTA.py" creates the file USA_ZBTA_2000.shp His map should come from one of those two.

Kirs10-Riley commented 2 years ago

Monday, August 22

Hours Worked Today: 3 Total Hours Worked: 164 Hours Worked this week: 3 Tasks that I am assigned: Finish digitizing 1989 Reflection: I did not make much progress trying to find out what made matched_data. But the newest curve ball I found was that matched_data has 8,491 observations where 1997 address_ID has 6649 observations, and catfacilities has 11,319. Moreover, Dylan and Kelly's corr_catlabs97 has 12795 addresses.

I am starting to think matched_data comes from a different script all together.

Kirs10-Riley commented 2 years ago

Tuesday, August 23

Hours Worked Today: 3 Total Hours Worked: 167 Hours Worked this week: 6 Tasks that I am assigned: Finish digitizing 1989 Reflection:

I think that there might be a mistake in line 129, but I am not sure. I wrote what the line says verbatim and what I think he wanted it to say. In the following note pad I wrote what I think he wanted to say based off of the code vs what the code actually says (regex notes.pdf). Line 130 says to find all the expressions in "text" that matches regex and put it into a list called extracted_data.

I ran both versions of "regex" and they both came back with 6099 observations.

I started looking at the pdf and will try writing a new "regex" tomorrow.

kbuzard commented 2 years ago

@Kirs10-Riley Let me know if you need my help with this. Sorry we didn't have more time to talk today!

Kirs10-Riley commented 2 years ago

Friday, August 26

Hours Worked Today: 1 Total Hours Worked: 168 Hours Worked this week: 7 Tasks that I am assigned: Finish digitizing 1989 Reflection: I tried running different sections of the code to see if the order played a role in the amount of observations, like what was discussed at last meeting and the current sequence creates the most observations. By cutting one of the lines out or replacing it we lose 2000+ observations. This tells me that the cleaning code that takes out spaces, special characters, etc. is the great. Next I will try to look at the pdf and see what the pattern is.

Saturday, August 27

Hours Worked Today: 3 Total Hours Worked: 171 Hours Worked this week: 10 Tasks that I am assigned: Finish digitizing 1989 Reflection: I opened up the 1998 and looked to see how they are organized to be able to extract out the addresses and names.

The pattern I found was: NAME - Address - telephone number - fax number (not always there) - email (not always there) - staff info- followed by a description of the company.

This is true for all the baby/sub facilities (could you remind me what we were calling them, I forgot).

I looked back at the code and he told python to find all telephone numbers because the telephone numbers is the only constant that follows directly after the address. Therefore I am unsure about where he took out the baby/sub facilities.

Monday, August 29

Hours Worked Today: 1 Total Hours Worked: 172 Hours Worked this week: 1 Tasks that I am assigned: Finish digitizing 1989 Reflection: I continued examining the code to see where the baby facilities were taken out. Having looked through the entire code I think it has something to do with the checked_HQ function.

This loop is a helper function that determines if the ID within the string form the extracted data is an HQ.This loop uses the bool() function that returns the boolean (data type: 1 = True, 0= false) value of a specific function. In this function it ask python to search the data for expression patterns and hold it if it is there.

[A-Z]{1} - matches any alphabet from a to z to it's left one time [0-9]{1} - matches any number 0-9 to it's left one time [0-9]{1,0} - matches any number 0-9, 0 to 1 times. searched expressions could include A1 or A12

I am a little stuck.

kbuzard commented 2 years ago

Thank you for breaking down the meaning of the regular expressions. This leads me to think there is one relatively easy fix: a LOT of the firms have a number above 99. Is the code catching these (easy to check by looking at the output)? If not, maybe instead of [0-9]{1}[0-9]{0,1}[0-9]{0,1}, we need something like `[0-9]{0,1,2}. I'm not sure how to interpret those three statements about the numbers in a row...

That is a separate issue from getting the subsidiary labs included in the output (this is the best term I've come up with so far: "non-HQ" is not quite right, because there are HQs that also are labs). Please look carefully at the output in data_2 to see if the codes for those subsidiary labs are getting their codes fixed (that is, the ".1"s etc added on).

Kirs10-Riley commented 2 years ago

Wednesday, August 31

Hours Worked Today: 4 Total Hours Worked: 176 Hours Worked this week: 4 Tasks that I am assigned: Finish digitizing 1989 Reflection: FIXED THE SUB PROBLEM!!!

I reorganized the address_ID python script and discovered one of the functions that helps find addresses was not used, and broke down each part of the code by section.

During my meeting with Prof. Buzard, we discovered that the regex code that read in the address had a period in the wrong place, and needed an extra period in a spot. This change brought the observations up to 10,000 which is 1,000 less then the Ivan png data.

Friday, September 2

Hours Worked Today: 3 Total Hours Worked: 179 Hours Worked this week: 7 Tasks that I am assigned: Finish digitizing 1989 Reflection: Continued working on the Address_Id and making matched_data.

Getting rid of the Lab names: I tried using lstrip() but I keep getting errors. The four times I got the script to work I somehow added a comma after every word, added a comma after every letter, made the list nothing but commas, or it did absolutely nothing.

Matched_ID89: I created a first run of Matched_ID89. I lost about 1000 observations in this process and the addresses have the names still attached but it is a good first run! The code is saved and the Matched_ID89 csv is in the 1989 folder.

Kirs10-Riley commented 2 years ago

Saturday, September 3

Hours Worked Today: 3 Total Hours Worked: 182 Hours Worked this week: 10 Tasks that I am assigned: Finish digitizing 1989 Reflection:

matched_id89: I merged the the data frame to the left, this solved the loss of observations but leaves labs that are doing research addresses-less. This is most likely due to the the ocr not scanning them. These 1000 might have to be entered manually.

Getting rid of the Lab names: Made a lot of progress of what not to do. fac_regex(x) is not very helpful or useful in this task but I think I am using it wrong.

kbuzard commented 2 years ago

Getting rid of the Lab names: Made a lot of progress of what not to do

Like I always quote, "There is no such thing as failure. Only learning." :)

Have you tried something like what's discussed in this thread? I think that, after matching, you can search for what's in the "facility_name" variable and keep everything after that. You'd probably have to go through afterwards to take out the "INC"s and things like it, but you'd be close.

Kirs10-Riley commented 2 years ago

Tuesday, Sep 6

Hours Worked Today: 4 Total Hours Worked: 186 Hours Worked this week: 4 Tasks that I am assigned: Finish digitizing 1989 Reflection: I tired using the re.match, re.split, and re.findall on data_7 but I kept getting errors. I read into it and I think that because data_7 is a list and not a string the problem. I made data_8 the string version of data_7 and made catfaclist the string version of catFacList. I made some progress but ran into type errors that i have not figured out yet. I tried googling them but I did not find anything that applied.

for re.split I was thinking of separate each column by commas which (in theory) could leave me with 3-4 rows then I would delete the first row and merge them back together. I have not found a way to make this work.

for l.strip(), I can run this function without getting errors but for some reason I keep adding commas. I tired learning more about the function and it seems to be doing the exact opposite of what it is suppose to do. So I think I am taking a break from this function.

My next trial is to write a regex code that searches for 10-25 capital letters followed by a comma and replace them with nothing. Problem with the process is that the sub-labs are not capitalized.

Kirs10-Riley commented 2 years ago

Friday, Sep. 9

Hours Worked Today: 2 Total Hours Worked (Fall 2022): 2 Hours Worked this week: 6 Tasks that I am assigned: Finish digitizing 1989 Reflection: Used the assign x to lab names method. this did not work

Monday, Sep 12

Hours Worked Today: 1 Total Hours Worked (Fall 2022): 3 Hours Worked this week: 1 Tasks that I am assigned: Finish digitizing 1989 Reflection: I looked into the A1 lab to see where that went missing. I started with png for 1989. For this I had to use my old environment, this did not work. There are still a lot of bugs in the environment where the pop up keeps coming up and the kernal never loads. I had to restart the remote desk top to it to close. Without running each script again I went into the excel sheet and found A1 on row 1436 of Address_ID1989. So we didn't lose it the addresses are just not in the correct order.

Friday, Sept 16

Hours Worked Today: 2 Total Hours Worked (Fall 2022): 5 Hours Worked this week: 3 Tasks that I am assigned: Finish digitizing 1989 Reflection: Met with prof. Buzard, called IT and got my desktop to start working again!

Wednesday, Sep 21

Hours Worked Today: 1.5 Total Hours Worked (Fall 2022): 4.5 Hours Worked this week: 1.5 Tasks that I am assigned: Finish digitizing 1989 Reflection: I tried to assign catFacilites.Facility_names to x and create a list of all the lengths but I kept running into problems. I read up into it and I think I can import these two excels into R and figure this out better

Kirs10-Riley commented 2 years ago

Tuesday, Sept 27

Hours Worked Today: 0.7 Total Hours Worked (Fall 2022 not logged): 4.5 Total Hours Work (Fall 2022): 5.2 Hours Worked this week: 0.7 Tasks that I am assigned: Finish digitizing 1989 Reflection: Worked on the loops and thought of the idea to used the merged data sets to solve the problem of differentiating lengths. In this way if there are no facility name for then the length would be 0 and nothing will have been deleted or if they are not a research lab than their address will be deleted. I also wrote the code for getting rid of "INC" using Anthony's previous code layout.

Wednesday, Sept 28

Hours Worked Today: 1.4 Total Hours Worked (Fall 2022 not logged): 4.5 Total Hours Work (Fall 2022): 6.6 Hours Worked this week: 2.1 Tasks that I am assigned: Finish digitizing 1989 Reflection: Met with professor Buzard and started editing the source document for errors that were producing duplicates of the same facility code.

Kirs10-Riley commented 2 years ago

Wednesday, October 5

Hours Worked Today: 2.2 Total Hours Worked (Fall 2022 not logged): 4.5 Total Hours Work (Fall 2022): 6.6 Hours Worked this week: 2.2 Tasks that I am assigned: Finish digitizing 1989 Reflection:

I went through 75% all of the "Fix these" and fixed them. Ran the list again and got 169 more variables.

I noticed that

5 tend to have a “.” Before.
.4 is usually an A
.1 is sometimes an A

having found this out I went through a lot of the 5s near the Fix these variables

Kirs10-Riley commented 2 years ago

Friday , October 7

Hours Worked Today: 2.6 Total Hours Worked (Fall 2022 not logged): 4.5 Total Hours Work (Fall 2022): 9.2 Hours Worked this week: 4.8 Tasks that I am assigned: Finish digitizing 1989 Reflection: Meeting, created a new fix_these and started looking at how to take out Canada addresses

Thursday , October 13

Hours Worked Today: 1.3 Total Hours Worked (Fall 2022 not logged): 4.5 Total Hours Work (Fall 2022): 10.5 Hours Worked this week: 1.3 Tasks that I am assigned: Finish digitizing 1989 Reflection:

Process: I looked at different Canada area codes in the OCR notepad for 1989

M5S 1A4
M3H 5S5
K6V 5V3 Canada Addresses end in ". Tel:" pattern = LETTER+number+LETTER+space+number+LETTER+number+". Tel:" translated into regex = r"[A-Z]{1}[1-9]{1}[A-Z]{1} [1-9]{1}[A-Z]{1}[1-9]{1}. Tel:"

I got this to work!!!!

Number of Canadian Addresses are under the variable Num_of_Can. there are 74 Canadian Zip codes that python picked up on. I took these addresses out, code is on line 30- 39.

Before getting rid of Canada : Fix_These2 = 1683 observations After getting rid of Canada : Fix_These2 = 1626 observation. -> It only reduced it by 57....

How should I proceed? Should I just go ahead and hand check the rest of the 1626 observations?

kbuzard commented 2 years ago

Big win! Even if it didn't fix more of them, this was a big step in terms of turning what you've learned about regular expressions into tangible output.

I don't think you'll have to fix all the rest by hand, because there will probably be other systematic issues that you find that you can use code to deal with. But you'll have to start doing it by hand and figure out what those time-saving fixes are as you go along.

One thing that might be helpful to refine the code that fixes the Canadian address: on page 513 (document numbering) of the Png scan for 1989 lists all the facilities in Canada. It looks like there are probably 100 or more, so it seems like something must be preventing your code from catching all of them.

Will you do one check for me? If you can do a case-sensitive search in notepad, search for all the "

Kirs10-Riley commented 2 years ago

Sunday , October 16

Hours Worked Today: 1.9 Total Hours Worked (Fall 2022 not logged): 4.5 Total Hours Work (Fall 2022): 12.4 Hours Worked this week: 1.9 Tasks that I am assigned: Finish digitizing 1989 Reflection:

I went through x-z to see what I could find most of the easy fixes were .1 and .4 were converted to As or a miss numbering.
These were the ones I could not fix. They seem perfectly fine in the note pad. I tried changing the sport where the code that says what type of research company they are and it did not do anything new.
- X5.8
- X5.9
- X5.10
- Y11
I looked into the Canadian Addresses and I think one of the problems is that the ". Tel:" part might not always be at the end depending on how the pages were read in. So I got rid of that and got it up to 77. I will keep looking into this.
I saw one address where the 4 turned into an A. This could be the problem w the rest of them
I counted and there are 95 of them so I am missing 18.

My next step is to repeat the code for Canada for the rest of the foreign countries using their zip code format.

kbuzard commented 2 years ago

Great progress here. Just keep a list of all the ones that you've checked and you can't figure out. I'm hoping that some of them will be resolved by fixing some other error that comes before it. When you have a list of 20 or so that you can't solve, post them for me and I'll see if I can see anything systematic.

kbuzard / labs