kbuzard / labs

MIT License
1 stars 5 forks source link

Estimating data entry costs for Bowker directories #57

Closed kbuzard closed 1 year ago

kbuzard commented 1 year ago

I need to get an estimate of how many characters there are in the addresses of R&D labs that we have in PDF form (to estimate the cost of having them digitized).

To this end, please do the following:

  1. Download cattell-all.dta and Scans.zip from https://drive.google.com/drive/folders/1JVsX1yhCptfCGsrSNeTNlpvdbO84eyQY
  2. Make a summary table of the number of observations per year in cattell-all.dta (the second variable in the dataset is called "year").
  3. Make a dataset that just has the observations for year 1994. Drop all the variables except id and facility_name.
  4. Split the variable id into three parts: the leading letter, all the numbers that come after the letter but before the period, and the numbers after the period. Sort by all three new columns (instead of sorting by id alone, which does not put the id in the order they show up in the PDF (see step 6 below). You may have to fill in a zero for the entries that have no period in order to get it to sort those entries correctly (they should come before the ".1" entries.
  5. Output a csv version; it should have id, facility_name, and the three new variables.
  6. Unzip Scans.zip and open the PDF that corresponds to 1994. It may be labeled 1995; check 5 random id/facility_name pairs between the csv and the PDF to make sure it's the right one. Once you've verified this, go to the page that starts with id A1.
  7. Open the 1994 csv. Add columns for street, city, state and zipcode. Start a timer. Enter the addresses in these four new variables for the first 100 rows. Stop the timer and record how long this took.
  8. Use the "=LEN(cell_address)" command in four new columns to count the characters in each of the four columns where you entered data. Find the average number of total characters (adding up the characters for all four columns) across the 100 rows.
  9. Use this average to estimate the total number of characters in the addresses of all the entries in 1994 by extrapolating linearly to the total number of rows for 1994.
kbuzard commented 1 year ago

@bteruya In case you didn't get a notification when I assigned this task to you.

bteruya commented 1 year ago

Hi Kristy, ExportCattell2.xlsx

I finished this task.

In the beginning, it took me 15 minutes for every 20 entries, and in the end 25 minutes for 40 entries.

The average length of characters is 35. There are a total of 12,698 firms for 1994 then it is a total of _447,478 extrapolated characters._

I used 2 lines for address one for the number and street and other one for the apartment or suite. Some addresses have PO Box only and I put it there. Some zipcodes have a dash in the middle I put it as such.

[https://docs.google.com/spreadsheets/d/19r4vUovew-ZwRExURXaUcay2Yjyxm33F/edit?usp=share_link&ouid=103823610493834312691&rtpof=true&sd=true]

Here is the link of the spreadsheet I worked with. In line 101 are the summaries. I also uploaded it

Best Brenda

kbuzard commented 1 year ago

Great--thanks! I've just put all the estimates together and sent this information off to my co-authors and their administrators who will decide whether / how to fund the project!