Open kbuzard opened 2 years ago
Thanks @Kirs10-Riley ! Keep a list of the files that won't open and check to see if Jorge knows what they are. If they mostly have the same couple of extensions (e.g., .py), it likely is an issue with not having a program you need on the computer you're working on. I will prioritize getting you access to a virtual machine that has the right software.
@Kirs10-Riley : My notes say that you were granted access to MAX-KBLABS-03 (the virtual machine that has OmniPage on it) last summer. Will you check to see if you can access it?
To do so, you need to get logged in at rds.syr.edu; then open "Remote Desktop Connection" and put in "MAX-KBLABS-03.ad.syr.edu" for the computer name. On the next popup window, put in your usual NetID password. Please let me know whether it works or not.
@Kirs10-Riley : My notes say that you were granted access to MAX-KBLABS-03 (the virtual machine that has OmniPage on it) last summer. Will you check to see if you can access it?
To do so, you need to get logged in at rds.syr.edu; then open "Remote Desktop Connection" and put in "MAX-KBLABS-03.ad.syr.edu" for the computer name. On the next popup window, put in your usual NetID password. Please let me know whether it works or not.
It works!
reflection continued: I refreshed the page and spyder is installed. Now my problem is getting Spyder to stop forcing me to fill out an internal problem report or close Spyder.
reflection continued: I refreshed the page and spyder is installed. Now my problem is getting Spyder to stop forcing me to fill out an internal problem report or close Spyder.
You can contact ictresearch@syr.edu for help with this kind of troubleshooting. Whenever you need admin approval, they'll have to do the thing. They're very responsive. Just carefully explain the issue and make sure to tell them which virtual machine you're working on. @JorgeValde This goes for you too.
Hours Worked Today: 4 Total Hours Worked: 88 Hours Worked this week: 13 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension, figure out what was left on Dylan and Kelly's work Reflection: Ran through 9 python scripts. I am keeping a list of which ones work and which ones come back with an error to compare with Jorge.
Hours Worked Today: 4 Total Hours Worked: 92 Hours Worked this week: 13 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, learn how to run script on python, figure out what files are used/ not used, and add a list of file extension, figure out what was left on Dylan and Kelly's work Reflection: Created a new environment with the another package and started to rerun the python scripts.
Hours Worked Today: 4
Total Hours Worked: 96
Hours Worked this week: 17
Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge.
Reflection: I've been fighting a lot with Spyder but I think I got the program to work! I finished running the python scripts and sent my notes on each file to Jorge for him to look over. 10/20 of the scripts I was able to find on Spyder run successfully and our outcomes were almost identical as well. My notes are attached below
Anywhere you have a "module not found" error, it means you need to install the relevant module in your environment.
Most of the other errors look like what I would call sequencing problems--because the scripts aren't run in the right order (so the output from one script hasn't yet been created, and it's needed as an input). This sequencing is what @JorgeValde needs to figure out and document.
And I should have said! Congrats! This is a big step!
Hours Worked Today: 2 Total Hours Worked: 98 Hours Worked this week: 2 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge. Reflection: Completed corrections on the G-Drive catalog. I made the links to the duplicated the titles. The files that said duplicate have no original outside of Dylan and Kelly's folder so I made the "originals" in Dylan's folder because his was first. The duplicates are not involved in any of the programs. But Kelly's corr_cattLabs97_Wgeocode_Line 6200 to Line 12765 document was last edited in august where all of the other corr_cattLabs97_Wgeocode were edited before that. This could mean that the "master copy" in the pngData Folder could not be completely corrected.
Hours Worked Today: 6 Total Hours Worked: 104 Hours Worked this week: 8 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge. Reflection: I read Jorge's notes that he sent me and re-ran the Error scripts with the improved environment. Submitted a pull-request of the corrected rmd and pdf for the G-drive catalog, Had to go back and make some last minute edits before re-uploading to my repository. Had a team meeting where we figured/ fixed most the scripts that did not run the first two times. I submitted a second pull-request with the changes from the G-Drive that were made in the meeting.
Thanks @Kirs10-Riley ! The G drive document looks good--I just merged it!
Wednesday, July 13 Hours Worked Today: 1 Total Hours Worked: 105 Hours Worked this week: 9 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge. Reflection: Updated the discussion on "How to run a Python Script" with the up to date instructions on how to set up the environment for the ramosRivera Python Files. I realized that the pdf I submitted a pull-request for yesterday was not the correct one with the changes made yesterday. I Uploaded and submitted the correct pdf.
Wednesday, July 13 Hours Worked Today: 3 Total Hours Worked: 108 Hours Worked this week: 12 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge, Kelly investigation Reflection: started re-running the scripts, so far my outcomes are identical to the notes Jorge sent me yesterday. I will finish re-running all of them before our next meeting and update my notes. I also met with Prof. Buzard and Jorge to troubleshoot the three python scripts that came back with errors. I was also assigned a new task to find out as much as I can on the digitization process of the 1989 data. I have not started that task yet.
Hours Worked Today: 6 Total Hours Worked: 114 Hours Worked this week: 6 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge, Kelly investigation Reflection: I looked into the process of the 1989 data and put my initial thoughts and information into a discussion board. I also finished re-running the scripts and here are my notes ->
Hours Worked Today: 2 Total Hours Worked: 116 Hours Worked this week: 8 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge, Kelly investigation, find cal_labs97.csv Reflection: Meeting and I looked at Dylan's notes. There was a lot on the the Omipage process but not much insight to what Antonio's process was.
Hours Worked Today: 4 Total Hours Worked: 116 Hours Worked this week: 12 Tasks that I am assigned: Catalog contents of ramosRivera folder on G: drive, run the python scripts and compare notes with Jorge, Kelly investigation, find cal_labs97.csv Reflection:
The closest csv file that I could find to calLabs97 was CattLabs97.csv. It has all the same variables except for unnamed 0_x, unnamed 0.1, and address. There are 12,779 labs and the file is 1335KB
@JorgeValde also wanted me to look into how four files were made, these were my notes:
matched_data.csv that is used in GeoCoder.py
field_lab_counts2 used in stat_calc.py
cite_same.csv used in statcalc.py
cal_labs97.csv
If they are confusing I can add more details.
Thanks for all of this @Kirs10-Riley . I integrated the parts that I knew were relevant to answering @JorgeValde's questions; I'm sure he'll get back to you on the others.
Hours Worked Today: 3 Total Hours Worked: 119 Hours Worked this week: 3 Tasks that I am assigned: Finish digitizing 1989 Reflection: I looked for cal_labs97.csv and did not find anything on how it was made. I met with prof. Buzard and Jorge to discus what we have been focusing on and the next steps. I am going to start going to finish the digitization of the 1989 data that Kelly and Antonio started.
Hours Worked Today: 3 Total Hours Worked: 122 Hours Worked this week: 6 Tasks that I am assigned: Finish digitizing 1989 Reflection: I created a folder in the Python Script folder called 1989 where all of the copied scripts are, but I am thinking of moving this folder somewhere else, most likely under Kelly's work. I need to find/ make a lot of files for the data to be digitized and it might be better to have it more centralized. started looking the Address_ID script and my notes are below. I only got to line 22 before I got stuck, my notes are attached below. I will look into this more tomorrow and hopefully an epiphany will be made.
Hours Worked Today: 1 Total Hours Worked: 123 Hours Worked this week: 7 Tasks that I am assigned: Finish digitizing 1989 Reflection: I added the new notes into my documentation and moved the folder from the python folder into the ramosRivera folder. I also added Kelly's work and the files I found in Ivan png's website that pertained to 1989.
Hours Worked Today: 5 Total Hours Worked: 128 Hours Worked this week: 5 Tasks that I am assigned: Finish digitizing 1989 Reflection: Notes are attached below. I will created a .md file tomorrow. I was able to create cattLabs89 but the contents are empty. I figured out why and tried to figure out a way to create 89 versions of cattell-all and field and documented those trials under pngwork.py outcome.
Hours Worked Today: 4 Total Hours Worked: 132 Hours Worked this week: 9 Tasks that I am assigned: Finish digitizing 1989 Reflection: Met with Prof. Buzard and created cattlabs97.csv. Emailed ITC research and they told me the most likely scenario is that my environment is corrupted and recommended I delete that environment and create a new one. I successfully deleted both Kirsten_Envi and Kirsten_Envi1 and started creating Envi_Kirsten. I left off in the middle of reinstalling GeoPandas.
Hours Worked Today: 4 Total Hours Worked: 136 Hours Worked this week: 136 Tasks that I am assigned: Finish digitizing 1989 Reflection: Finished creating the environment and launched the environment, tried opening the Spyder and the same thing occurred. I emailed them again, until then I will just be working on the default Spyder environment, I do not think Address_Id requires any of those packages. documentation on wiki has been updated to include up to line 34 of Address_ID.
Hours Worked Today: .5 Total Hours Worked: 136.5 Hours Worked this week: 13.5 Tasks that I am assigned: Finish digitizing 1989 Reflection: emailed back in forth with ITC, they don't know whats going on either but have a few ideas. I am meeting with them tomorrow morning.
Hours Worked Today: 2.5 Total Hours Worked: 139 Hours Worked this week: 16 Tasks that I am assigned: Finish digitizing 1989 Reflection: Met with ITC... Twice. After our second meeting they told me log out and they'll go in after lunch and do a couple of things. So I'll be Spyder-less until then.
Hours Worked Today: 5 Total Hours Worked: 144 Hours Worked this week: 5 Tasks that I am assigned: Finish digitizing 1989 Reflection: I heard back from ITC, they said everything should work, my environments are gone and I started creating a new one but worked from the default environment while that was loading. I got to line 117 before I hit an error. I'll go over my finding tomorrow with Jorge during out meeting.
Hours Worked Today: 2 Total Hours Worked: 146 Hours Worked this week: 7 Tasks that I am assigned: Finish digitizing 1989 Reflection: Met with Jorge and he figured out the error I received on line 117. I made a not next to the new added line.
Hours Worked Today: 2 Total Hours Worked: 148 Hours Worked this week: 9 Tasks that I am assigned: Finish digitizing 1989 Reflection: Finished up to line 130, Notes have been added to the wiki.
Hours Worked Today: 5 Total Hours Worked: 153 Hours Worked this week: 14 Tasks that I am assigned: Finish digitizing 1989 Reflection: Completed Address_ID_1989 🥳 and wrote the final two lines that saves the data to an csv file called Address_ID89.csv in the 1989 folder. The wiki has been updated.
Completed Address_ID_1989 🥳 and wrote the final two lines that saves the data to an csv file called Address_ID89.csv in the 1989 folder
It looks GOOD @Kirs10-Riley !!!
I noticed a couple of things that might create problems when geocoding (sometimes there's no comma between the company name and the address; some entries have an asterisk just before the address and some do not), but I think we should try running this through the geocoder and see how it goes.
I think that means that we should be figuring out GeoCoder.py next (@JorgeValde Does this make sense to you?)
Hours Worked Today: 3 Total Hours Worked: 156 Hours Worked this week: 3 Tasks that I am assigned: Finish digitizing 1989 Reflection: Meeting and started looking at the C.py script and marking things that might help the process in address_ID
Hours Worked Today: 4 Total Hours Worked: 159 Hours Worked this week: 4 Tasks that I am assigned: Finish digitizing 1989 Reflection: I ran into some problems creating matched_data and will probably need some assistance to figure out the next step. I updated my notes with a brief summary of what I did.
a brief summary of what I did
My best guess is that you're absolutely right that it has something to do with headquarters. I suggest you find a few examples of the locations that are missing from the smaller dataset, and then look them up in the PDF of the directory. If they don't have information about R&D, then they were in the directory only to show the company structure and they're not actually a lab.
If this is right, then we have to figure out how to drop the non-lab observations. It might be the ones that have asterisks in them; or there might be a variable that is systematically missing from those observations (like Field of R&D); or Antonio might have connected to the field data; I would imagine that only the locations with R&D show up in those lists..
Hours Worked Today: 5 Total Hours Worked: 161 Hours Worked this week: 5 Tasks that I am assigned: Finish digitizing 1989 Reflection:
In the map that Antonio created for the whole U.S. in 1998, there are 10,346 points mapped (that's consistent with what he told me that he wasn't able to geocode around 10% of the labs)--so they all got done somehow. If we can figure out how he got to this, it could be very helpful.
Do either you or @JorgeValde have any idea how he made this map? It would help @Kirs10-Riley and me understand the problems we're having with getting all the lab addresses in 1989.
The python script called "Prep_labs.py" I believe it creates the file USA_labs_2000.shp I used that one to create the map I made and I took that file from the backup documentation. The python script "Prep_ZBTA.py" creates the file USA_ZBTA_2000.shp His map should come from one of those two.
Hours Worked Today: 3 Total Hours Worked: 164 Hours Worked this week: 3 Tasks that I am assigned: Finish digitizing 1989 Reflection: I did not make much progress trying to find out what made matched_data. But the newest curve ball I found was that matched_data has 8,491 observations where 1997 address_ID has 6649 observations, and catfacilities has 11,319. Moreover, Dylan and Kelly's corr_catlabs97 has 12795 addresses.
I am starting to think matched_data comes from a different script all together.
Hours Worked Today: 3 Total Hours Worked: 167 Hours Worked this week: 6 Tasks that I am assigned: Finish digitizing 1989 Reflection:
I think that there might be a mistake in line 129, but I am not sure. I wrote what the line says verbatim and what I think he wanted it to say. In the following note pad I wrote what I think he wanted to say based off of the code vs what the code actually says (regex notes.pdf). Line 130 says to find all the expressions in "text" that matches regex and put it into a list called extracted_data.
I ran both versions of "regex" and they both came back with 6099 observations.
I started looking at the pdf and will try writing a new "regex" tomorrow.
@Kirs10-Riley Let me know if you need my help with this. Sorry we didn't have more time to talk today!
Hours Worked Today: 1 Total Hours Worked: 168 Hours Worked this week: 7 Tasks that I am assigned: Finish digitizing 1989 Reflection: I tried running different sections of the code to see if the order played a role in the amount of observations, like what was discussed at last meeting and the current sequence creates the most observations. By cutting one of the lines out or replacing it we lose 2000+ observations. This tells me that the cleaning code that takes out spaces, special characters, etc. is the great. Next I will try to look at the pdf and see what the pattern is.
Hours Worked Today: 3 Total Hours Worked: 171 Hours Worked this week: 10 Tasks that I am assigned: Finish digitizing 1989 Reflection: I opened up the 1998 and looked to see how they are organized to be able to extract out the addresses and names.
The pattern I found was: NAME - Address - telephone number - fax number (not always there) - email (not always there) - staff info- followed by a description of the company.
This is true for all the baby/sub facilities (could you remind me what we were calling them, I forgot).
I looked back at the code and he told python to find all telephone numbers because the telephone numbers is the only constant that follows directly after the address. Therefore I am unsure about where he took out the baby/sub facilities.
Monday, August 29
Hours Worked Today: 1 Total Hours Worked: 172 Hours Worked this week: 1 Tasks that I am assigned: Finish digitizing 1989 Reflection: I continued examining the code to see where the baby facilities were taken out. Having looked through the entire code I think it has something to do with the checked_HQ function.
This loop is a helper function that determines if the ID within the string form the extracted data is an HQ.This loop uses the bool() function that returns the boolean (data type: 1 = True, 0= false) value of a specific function. In this function it ask python to search the data for expression patterns and hold it if it is there.
[A-Z]{1} - matches any alphabet from a to z to it's left one time [0-9]{1} - matches any number 0-9 to it's left one time [0-9]{1,0} - matches any number 0-9, 0 to 1 times. searched expressions could include A1 or A12
I am a little stuck.
Thank you for breaking down the meaning of the regular expressions. This leads me to think there is one relatively easy fix: a LOT of the firms have a number above 99. Is the code catching these (easy to check by looking at the output)? If not, maybe instead of [0-9]{1}[0-9]{0,1}[0-9]{0,1}, we need something like `[0-9]{0,1,2}. I'm not sure how to interpret those three statements about the numbers in a row...
That is a separate issue from getting the subsidiary labs included in the output (this is the best term I've come up with so far: "non-HQ" is not quite right, because there are HQs that also are labs). Please look carefully at the output in data_2 to see if the codes for those subsidiary labs are getting their codes fixed (that is, the ".1"s etc added on).
Hours Worked Today: 4 Total Hours Worked: 176 Hours Worked this week: 4 Tasks that I am assigned: Finish digitizing 1989 Reflection: FIXED THE SUB PROBLEM!!!
I reorganized the address_ID python script and discovered one of the functions that helps find addresses was not used, and broke down each part of the code by section.
During my meeting with Prof. Buzard, we discovered that the regex code that read in the address had a period in the wrong place, and needed an extra period in a spot. This change brought the observations up to 10,000 which is 1,000 less then the Ivan png data.
Hours Worked Today: 3 Total Hours Worked: 179 Hours Worked this week: 7 Tasks that I am assigned: Finish digitizing 1989 Reflection: Continued working on the Address_Id and making matched_data.
Getting rid of the Lab names: I tried using lstrip() but I keep getting errors. The four times I got the script to work I somehow added a comma after every word, added a comma after every letter, made the list nothing but commas, or it did absolutely nothing.
Matched_ID89: I created a first run of Matched_ID89. I lost about 1000 observations in this process and the addresses have the names still attached but it is a good first run! The code is saved and the Matched_ID89 csv is in the 1989 folder.
Hours Worked Today: 3 Total Hours Worked: 182 Hours Worked this week: 10 Tasks that I am assigned: Finish digitizing 1989 Reflection:
matched_id89: I merged the the data frame to the left, this solved the loss of observations but leaves labs that are doing research addresses-less. This is most likely due to the the ocr not scanning them. These 1000 might have to be entered manually.
Getting rid of the Lab names: Made a lot of progress of what not to do. fac_regex(x) is not very helpful or useful in this task but I think I am using it wrong.
Getting rid of the Lab names: Made a lot of progress of what not to do
Like I always quote, "There is no such thing as failure. Only learning." :)
Have you tried something like what's discussed in this thread? I think that, after matching, you can search for what's in the "facility_name" variable and keep everything after that. You'd probably have to go through afterwards to take out the "INC"s and things like it, but you'd be close.
Hours Worked Today: 4 Total Hours Worked: 186 Hours Worked this week: 4 Tasks that I am assigned: Finish digitizing 1989 Reflection: I tired using the re.match, re.split, and re.findall on data_7 but I kept getting errors. I read into it and I think that because data_7 is a list and not a string the problem. I made data_8 the string version of data_7 and made catfaclist the string version of catFacList. I made some progress but ran into type errors that i have not figured out yet. I tried googling them but I did not find anything that applied.
for re.split I was thinking of separate each column by commas which (in theory) could leave me with 3-4 rows then I would delete the first row and merge them back together. I have not found a way to make this work.
for l.strip(), I can run this function without getting errors but for some reason I keep adding commas. I tired learning more about the function and it seems to be doing the exact opposite of what it is suppose to do. So I think I am taking a break from this function.
My next trial is to write a regex code that searches for 10-25 capital letters followed by a comma and replace them with nothing. Problem with the process is that the sub-labs are not capitalized.
Hours Worked Today: 2 Total Hours Worked (Fall 2022): 2 Hours Worked this week: 6 Tasks that I am assigned: Finish digitizing 1989 Reflection: Used the assign x to lab names method. this did not work
Hours Worked Today: 1 Total Hours Worked (Fall 2022): 3 Hours Worked this week: 1 Tasks that I am assigned: Finish digitizing 1989 Reflection: I looked into the A1 lab to see where that went missing. I started with png for 1989. For this I had to use my old environment, this did not work. There are still a lot of bugs in the environment where the pop up keeps coming up and the kernal never loads. I had to restart the remote desk top to it to close. Without running each script again I went into the excel sheet and found A1 on row 1436 of Address_ID1989. So we didn't lose it the addresses are just not in the correct order.
Hours Worked Today: 2 Total Hours Worked (Fall 2022): 5 Hours Worked this week: 3 Tasks that I am assigned: Finish digitizing 1989 Reflection: Met with prof. Buzard, called IT and got my desktop to start working again!
Hours Worked Today: 1.5 Total Hours Worked (Fall 2022): 4.5 Hours Worked this week: 1.5 Tasks that I am assigned: Finish digitizing 1989 Reflection: I tried to assign catFacilites.Facility_names to x and create a list of all the lengths but I kept running into problems. I read up into it and I think I can import these two excels into R and figure this out better
Hours Worked Today: 0.7 Total Hours Worked (Fall 2022 not logged): 4.5 Total Hours Work (Fall 2022): 5.2 Hours Worked this week: 0.7 Tasks that I am assigned: Finish digitizing 1989 Reflection: Worked on the loops and thought of the idea to used the merged data sets to solve the problem of differentiating lengths. In this way if there are no facility name for then the length would be 0 and nothing will have been deleted or if they are not a research lab than their address will be deleted. I also wrote the code for getting rid of "INC" using Anthony's previous code layout.
Hours Worked Today: 1.4 Total Hours Worked (Fall 2022 not logged): 4.5 Total Hours Work (Fall 2022): 6.6 Hours Worked this week: 2.1 Tasks that I am assigned: Finish digitizing 1989 Reflection: Met with professor Buzard and started editing the source document for errors that were producing duplicates of the same facility code.
Hours Worked Today: 2.2 Total Hours Worked (Fall 2022 not logged): 4.5 Total Hours Work (Fall 2022): 6.6 Hours Worked this week: 2.2 Tasks that I am assigned: Finish digitizing 1989 Reflection:
I went through 75% all of the "Fix these" and fixed them. Ran the list again and got 169 more variables.
I noticed that
having found this out I went through a lot of the 5s near the Fix these variables
Hours Worked Today: 2.6 Total Hours Worked (Fall 2022 not logged): 4.5 Total Hours Work (Fall 2022): 9.2 Hours Worked this week: 4.8 Tasks that I am assigned: Finish digitizing 1989 Reflection: Meeting, created a new fix_these and started looking at how to take out Canada addresses
Hours Worked Today: 1.3 Total Hours Worked (Fall 2022 not logged): 4.5 Total Hours Work (Fall 2022): 10.5 Hours Worked this week: 1.3 Tasks that I am assigned: Finish digitizing 1989 Reflection:
Process: I looked at different Canada area codes in the OCR notepad for 1989
I got this to work!!!!
Number of Canadian Addresses are under the variable Num_of_Can. there are 74 Canadian Zip codes that python picked up on. I took these addresses out, code is on line 30- 39.
Before getting rid of Canada : Fix_These2 = 1683 observations After getting rid of Canada : Fix_These2 = 1626 observation. -> It only reduced it by 57....
How should I proceed? Should I just go ahead and hand check the rest of the 1626 observations?
Big win! Even if it didn't fix more of them, this was a big step in terms of turning what you've learned about regular expressions into tangible output.
I don't think you'll have to fix all the rest by hand, because there will probably be other systematic issues that you find that you can use code to deal with. But you'll have to start doing it by hand and figure out what those time-saving fixes are as you go along.
One thing that might be helpful to refine the code that fixes the Canadian address: on page 513 (document numbering) of the Png scan for 1989 lists all the facilities in Canada. It looks like there are probably 100 or more, so it seems like something must be preventing your code from catching all of them.
Hours Worked Today: 1.9 Total Hours Worked (Fall 2022 not logged): 4.5 Total Hours Work (Fall 2022): 12.4 Hours Worked this week: 1.9 Tasks that I am assigned: Finish digitizing 1989 Reflection:
My next step is to repeat the code for Canada for the rest of the foreign countries using their zip code format.
Great progress here. Just keep a list of all the ones that you've checked and you can't figure out. I'm hoping that some of them will be resolved by fixing some other error that comes before it. When you have a list of 20 or so that you can't solve, post them for me and I'll see if I can see anything systematic.
@Kirs10-Riley : please log your work daily here using the following format:
Tuesday, May 31