lizzieinvancouver / egret

1 stars 0 forks source link

scraping USDA Woody plant seed manual #15

Open lizzieinvancouver opened 6 months ago

lizzieinvancouver commented 6 months ago

I have the 2008 PDF version online here

@dbuona Can you report back from your current lab some best options for scraping data from 'Part II Data for 236 Genera' Table types we want:

  1. Tables with times of flowering, fruiting ripening and seed dispersal ... they often also have height and age at maturity -- all good info! Examples: Table 3—Abies, fir: phenology of flowering and fruiting, and major characteristics of mature trees and Table 2— Acacia, acacia: phenology of flowering, fruit ripening, and seed dispersal
  2. Various tables on treatments to get percent and/or rate of germination, for example for Acer you need at least two tables: Table 5—Acer, maple: warm and cold stratification treatments for internal dormancy and Table 6—Acer, maple: germination test conditions and results for stratified seeds while many others are like Table 5—Aesculus, buckeye: cold stratification periods, germination test conditions, and results
lizzieinvancouver commented 6 months ago

Two options from Dan's contacts:

FIRST

I'm not sure this is the best way to do this, but here is a way that should work and is a little different than Matt's (admittedly, Matt's idea looks a little more efficient, so I would probably start there).

  1. Download the PDF
  2. Open the PDF in Acrobat and export it as an HTML file
  3. Load the HTML file into R
  4. Scrape using 'rvest' package as if it was a normal website

The big issue with this is that exporting a PDF to HTML inevitably messes up some of the formatting - so the label for Table 6 that you mentioned is not at the top of the table that actually contains the data. But you can work around that by navigating through the document by using the bits of formatting that are preserved. For example, by searching for a paragraph element that contains "Ulmus" in 28pt font, you can arrive at the right section - then, you can scrape all data in tables within that section (using the table attribute in HTML), and assign them titles afterwards (Table 1-6) based on the order of appearance. If you needed the information in the table captions from the original document, you could scrape those as well and append them back to the data in the tables using a few extra steps of code. Like I said, I'm not sure this is the most efficient way to go about things, but it is one option.

SECOND

This looks like the most promising option, but I would also look into the pdf_tools package. I was able to get the pdf_text function from pdf_tools to extract the data in that table, but it's not structured correctly and would require some creative data wrangling to get it how you want. The package I linked lets you select certain parts of the table to extract from which is why I thought it might be a better option, but it takes some time to get it set up to work properly (Java, etc.). Hope this helps somewhat!

lizzieinvancouver commented 5 months ago

See also conversation with Vic Vankus about this in issue #11 (12 Jan I think).

selenashew commented 5 months ago

Updates on various data scraping options from the lab meeting on Jan. 16:

lizzieinvancouver commented 5 months ago

@selenashew This chatGPT options like a great approach! My thanks to you and Sandy and @DeirdreLoughnan for it. I am excited to hear how it goes.

lizzieinvancouver commented 5 months ago

More info I should have put here on what to scrape:

I'd like to try to scrape all tables of these two types:

1) 'phenology of flowering and fruiting' (for example: Table 2 for Fagus on pg 520; Table 3 on pg 945)

2) 'germination test conditions and results' (for example: Table 5 for Fagus on pg 523; Table 3 for Aronia on pg 273; Table 4 for Oak on pg 936)

I am also interested in tables like (if possible), depending on how many there are (I am interested in the seed-bearing age and seedcrop frequency mostly):

3) 'height, seed-bearing age, and seedcrop frequency' (examples on pg 933, and pg 221)

sandyie commented 4 months ago

Update on data scraping with chatGPT and other options (Feb 10)

  1. with chatGPT 4.0 I have tried using chatGPT 4.0 and a customized chatGPT for this task. The results were not very good, it failed to identify all tables with files larger than 10 pages and failed to parse tables larger than 5 lines. The limit of 35 messages per 3 hours is also a bottleneck. This is one of the chat summaries for the exploration: This is a conversation history for inputting 30+pages and asking it to parse 10 tables at a time: https://chat.openai.com/share/1abfc615-3587-4f86-bb94-cc93c900b75c This is a customized GPT for this purpose (but it's not working as well as I expected, but please feel free to play around :) ): https://chat.openai.com/g/g-l22W5I0Wl-table-parser-pro pdf_to_parse.pdf

  2. with amazon Textract I found this aws tool as it claims to "Extract text and structured data such as tables and forms from documents using artificial intelligence". I tried playing with it using a pdf consisting of (Table 2 for Fagus on pg 520; Table 3 on pg 945, Table 5 for Fagus on pg 523; Table 3 for Aronia on pg 273), and the result is much better than chatGPT. Textract will parse all the tables and return each table as a separate CSV file along with the confidence score. advantages: a. In cases when the confidence score is too low, AWS will send the info to human review and return the result to Amazon S3 (the "Google Drive"). b. The process can be done manually or through an endpoint so the process can be made more automatically. c. Free tier account can have the first 1000 pages for free and $1.5 per 1000 pages, much cheaper than chatGPT disadvantages: a. the parse CSV files will contain some confidence scores and format but can be cleaned by R b. some fields/values may be missed or wrong, as AI can be error-prone. some manual effort is required to quickly scan through as validation. This is a result returned by Textract parsed_pdf.zip

c. Free online OCR (https://www.onlineocr.net/) This online OCR will convert the entire PDF into an Excel file. The tables will be embedded in the Excel file and will require manual effort to find and extract those tables from the Excel file. I chose this online site since it returns the best results among all the free online OCR sites. advantage:

  1. Tables in the Excel file do not have weird formatting issues, less likely to miss field/data as it's simply converting pdf to Excel disadvantage:
  2. $5.95 per month for 1000 pages
  3. requires more effort to extract tables from the Excel file 截屏2024-02-10 下午4 21 28

This is a result returned by online OCR
online_ocr.xlsx

PDF file for all the exploration: parsed_pdf.pdf

lizzieinvancouver commented 4 months ago

@sandyie This is amazing! Let's do the Amazon option and chat tomorrow during scrum about how to get it done.

selenashew commented 3 months ago

March 12, 2024 Updates:

lizzieinvancouver commented 3 months ago

@selenashew Thanks for the update -- what a pain! The plan you and @sandyie have sounds good; I hope it works out well!

selenashew commented 3 months ago

March 22, 2023 Update:

lizzieinvancouver commented 3 months ago

@selenashew Thanks for the update but I am sorry to hear this! If you want to spread the work out to more folks (or have any questions about which tables we want), let me know. Either way, good luck and may the sorting force be with you!

selenashew commented 2 months ago

April 9, 2024 Update:

lizzieinvancouver commented 2 months ago

@selenashew Thanks for the update! It sounds like we're close to an exciting part -- closer to seeing the data! Let me know how it goes and thanks to you and @sandyie

selenashew commented 1 month ago

Hi @lizzieinvancouver @sandyie,

Here is the spreadsheet I've compiled with the data that Sandy and I scraped: https://docs.google.com/spreadsheets/d/1HA3fq327DZ50g86HYuBC0JrxOVRYZCySQH2NDJv-Pns/edit?usp=sharing

Here is the link to the drive with all of the raw scraped tables: https://drive.google.com/drive/folders/1y5iN_zyXsO5P7qmVPlQQtzV0Sze14tCI?usp=sharing

Please note that there are many gaps in the data due to differing columns amongst all of the original data tables- and there are many typos from parsing errors as well. I will continue to go through and clean up this data as best as I can this week!

lizzieinvancouver commented 1 month ago

@selenashew This is amazing! My thanks to you and @sandyie for doing this. The data look exciting... now that we have the data more of us can probably help on cleaning and we should prioritize before more cleaning get all the data and code organized with README info.

I could set up a new repo or we could use this one, but to decide what to do I need to know:

  1. Have you been pushing all your code for this to this repo? (If they are too big to push, then you can transfer ownership of the folder to me, so that it stays with the lab).
  2. How big is the biggest raw scraped table and how big are the files altogether?

Once we decide we should put ALL the scripts used (whatever coding language) in the repo, with a README that explains what each script did. If you have done any cleaning in the google files or `by hand' then we should note that clearly somewhere (and ideally, stop, and do the rest of the cleaning in R).

Also, can you make a quick and dirty estimate of how many TOTAL person hours this took (you and Sandy)? I might then use that info next time when we consider scraping things.

Finally, please set up a meeting with @wangxm-forest and look at the SILVICS manual with her. She wants to scrape some info out of that so I think you would be best to advise on how best to do it. @wangxm-forest please keep notes from the meeting on a git issue in your mast traits repo.

Thanks to all of you again! This is exciting.

wangxm-forest commented 1 month ago

I also took a look at the data and it looks amazing! @sandyie @selenashew Thank you for your effort on it. I'll be reaching out shortly to arrange a meeting where we can talk about details further!

selenashew commented 1 month ago

Hi everyone,

@lizzieinvancouver I can definitely push everything to this repo!

@wangxm-forest Sounds great! We'll be looking forward to the meeting!

  1. @sandyie Do you mind speaking more on this topic? I believe Sandy has the Amazon AWS account for the Textract tool that was used to scrape the data.
  2. All the files together amount to a file size of 22.0 MB. The biggest raw scraped data table is 14 KB and is located in ./800-900/wo_ah727-19-1203-801-900-31-40.zip, "germination_table-1.csv".

The manual work that was done was in checking through all of the scraped data tables, re-labelling the more relevant ones to reflect which type of table it is (phenology, germination, or seed), and in compiling the relevant data in the master spreadsheet. No other cleaning has been done yet for the master spreadsheet, which I am happy to do in R.

@sandyie can also confirm the total hours it took to run the entire manual through the Textract tool. In terms of the manual work that was done, it took about 15 hours on my part, with an additional 10 hours in logistical planning between Sandy and I as we researched possible tools and strategies, tested them out, and discussed how to proceed with this project.

lizzieinvancouver commented 1 month ago

@selenashew Thanks for this! If all the output together is 22 MB then we should add all this to the EGRET repo, including all code, a README etc. I think we should make a new folder in analyses called scrapeUSDAseedmanual -- put code of any type in there and add an input folder (for files that come from outside the code ... that maybe just the manual PDF (22 MB I think so should be okay) and output folder for files the code writes out. I might also add separate folders for scraping and cleaning to keep the scripts separate, but depends on how many you have. Let me know if you have any questions and thank you again!

selenashew commented 1 month ago

I've pushed all files on hand to the repo with the requested folders and have also properly formatted the README file. I will be meeting with @DeirdreLoughnan on May 24 to discuss data cleaning steps and @wangxm-forest on May 30 to discuss data scraping.

lizzieinvancouver commented 3 weeks ago

@selenashew This looks very good! I think you can close this issue unless you or @DeirdreLoughnan think of any good reasons not to (we can set up a separate cleaning issue).

DeirdreLoughnan commented 3 weeks ago

@selenashew @lizzieinvancouver I agree you should feel free to close the issue.

Moving forward we will continue to use git issue #20 to discuss the data cleaning.