wangxm-forest / mast_trait

0 stars 0 forks source link

Data scraping for Silvics #3

Open wangxm-forest opened 3 months ago

wangxm-forest commented 3 months ago

@lizzieinvancouver @selenashew

I found this website for Silvics of North America with a table of contents which directs us to different speciesSilvics of North America

The information we are interested in is mostly in the "Life History" section. Unfortunately, most of information we are interested in is presented in text rather than a table. In this case, both Selena and I believe that we may need to extract the information manually.

lizzieinvancouver commented 3 months ago

both Selena and I believe that we may need to extract the information manually.

@wangxm-forest Sounds good!

selenashew commented 2 weeks ago

Hi @wangxm-forest,

I'm so sorry that I wasn't able to work on this much during this summer! In order to help kick things off, I have created a new Data Input folder that contains the two Silvics pdfs that are to be scraped. I've also created an Excel file called silvicsDataScraping.xlsx that organizes the specific columns that want to be scraped, as well as what I was able to pull for Abies amabilis.

I looked into possible text parsing tools, and unfortunately so far it seems the tool that we used for the USDA manual (Amazon Textract) may not work as it is built to extract data from pre-made tables, not text. However, by simply prompting ChatGPT to parse text into a data table, it performed quite well:

image image

Using a prompt engineering tool such as ChatGPT may be extremely helpful in extracting data from text in this case, although the data would still need to be manually copied & pasted into the Excel file that I've created.

wangxm-forest commented 2 weeks ago

@selenashew Thank you for working on this! I believe the most challenging part of silvics is the inconsistency in species descriptions, not every species have the same traits included. However, using ChatGPT to parse text into a data table seems promising. I’ll look into that further. Thanks again!