hackforla / food-oasis

Repository for the current redevelopment of the Food Oasis Los Angeles website
https://foodoasis.la
GNU General Public License v2.0
68 stars 50 forks source link

Explore data collecting tools #1358

Open GigiUxR opened 1 year ago

GigiUxR commented 1 year ago

Explore the benefits, feasibility, and efforts necessary to implement an automated web scraping process to collect and parse raw data from relevant websites.

The idea of automated web scraping came to me while I was considering easier options for data validation. If we follow through with web scraping, it will lead to a branch of possibilities for issue #996 ( in which we could advise pantries to include information on their site per seeker demand as listed in our directory).

staceyrebekahscott commented 1 year ago

@GigiUxR Moving this to In Progress.

You might already be thinking this, but I would recommend discussing this with John D to see what is possible here.

Re: your comment in the overview about advising pantries- this is also something we can do by getting feedback from partner organizations, an idea I can roll into the Partnerships sub project.

entrotech commented 1 year ago

We did a LOT of work in this area in 2019. This competitive analysis document summarizes what other sites at the time were doing, and evaluated them as possible sources of data. The rows at the top were looking for what details our "competitors" had about each listing, and the rows at the bottom indicated each site as a possible source of data.

From a technical perspective, the techniques for importing data are - in order from most reliable and maintainable to least:

  1. Access to an API that allows us to import data in a well-defined structured format (usually XML or JSON),
  2. Access to a downloadable file (preferably a spreadsheet in a grid-like format),
  3. (As a last resort) web scraping their site.

The problems with web scraping are:

  1. The legality is questionable. Many sites have explicit rules about not scraping their data.
  2. The site's html must be structured in such a way that it is scrapable. This means that the elements we want to extract have named html tags that definitively identify the pieces of information. If the site uses paging to display long lists of entries on multiple pages, the programming is a bit more involved to page through all the data.
  3. Sites that use react or other dynamically generated content require a somewhat more sophisticated scraping library capable of running a "headless browser" which can execute the page's javascript to navigate the site.
  4. Each site you want to scrape is a separate programming project tailored to the html layout of the site.
  5. If the site design changes, you have to build the scraping component all over again from scratch.

We actually did some importing and scraping of data for Food Oasis. The ones that I remember off the top of my head are:

  1. We imported data from 211.org via their public API
  2. After several attempts, (I think it was Jabari Brown) got hold of a spreadsheet from LA Regional Food Bank of all their affiliates at the time, which we imported.
  3. We scraped data from the LA Public Library site (https://www.lapl.org/homeless-resources-food)

In each case, the result was a table of imported data. Once this was done, the real fun begins when we try to match the imported entries with our existing data to see if they are listings we already know about or new ones or ones that no longer exist. The most reliable matching tends to be by first normalizing the address to a standard format, running it through a geocoder to get lat/long coordinates and them running an algorithm to try to match to our existing locations. This is more reliable than the name of the pantry. Matching by phone numbers is more definitive but only works some of the time.

Once we decide that an imported listing is a match, then we need to re-organize their fields to match ours and compare values. Then we need to decide if the imported record is more accurate than the one we already have. In the vast majority of cases, our information is newer.

For imported records that don't match an existing listing in Food Oasis, the question is whether it is still open, does it lie within the county, would it be properly categorized as a Food Pantry or Meal Program. If there is enough contact information in the imported record, then we would need to contact the pantry to gather at least the minimum amount of detail and add it to the listing to use it in Food Oasis.

If we had encountered a good source that was reliable and up-to-date and had enough of the fields we needed to be useful, then the next steps would have been to work out a process to automate the above steps to re-import the data on a regular basis, but we never found a source worth importing.

IMO, it is worthwhile to keep looking for a definitive data source. If one can be found, then we should explore what the best process is for obtaining the data and merging it with the Food Oasis data.

GigiUxR commented 1 year ago

@entrotech @staceyrebekahscott Yes, I like the idea of identifying a definitive data source -- this will reduce the effort required for web scraping different sites.

To build off this idea, I read on a random website that: food pantries are nonprofit organizations that must abide by state and federal regulations. USDA requires agencies such as the Department of Human Services to regularly evaluate food pantries but this can vary state to state.

Therefore, somewhere there is a database of food pantries and possibly more. Here are some questions that come to mind:

entrotech commented 1 year ago

Starting at the top, much of the food distributed by pantries and meal programs is sourced from the USDA's Food and Nutrition Service (https://www.fns.usda.gov/). There are several programs that they administer, including TEFAP (The Emergency Food Assistance Program). Most nutrition assistance programs funded by FNS are administered at the state, territory, tribal, or local levels. Choosing California from the drop-down on their home page takes you to a page you can use to search by state and program. Using this, you can find the following contact info for the California TEFAP program: https://www.fns.usda.gov/fns-contacts?f%5B0%5D=fns_contact_state%3A286&f%5B1%5D=fns_contact_related_programs%3A27

Which leads to the California Department of Social Services Contacts at https://www.cdss.ca.gov/inforesources/fdu

Which leads to the list of TEFAP Providers here: https://www.cdss.ca.gov/inforesources/efap/stakeholders, which lists the LA County providers as

We have tried working with the LARFB, with very limited success as I mentioned in my last comment, but we could try building a better relationship with them to see if we can persuade them into sharing their data. They have a food finder page on their site (https://www.lafoodbank.org/find-food/pantry-locator/), but it isn't as good as ours, so it could be mutually beneficial if we were to provide our widget for their page, and cooperate on keeping the listings up to date. FWIW, I volunteered once for LARFB and spent a few hours gleaning onions.

I thought LA Regional Food bank was the sole provider of TEFAP food in LA County, so it was interesting to find the Food Bank of Southern California. To my knowledge, we have not contacted or tried to work with the Food Bank of Southern California.
Though they do not have a listing of their outlets (they call them agencies) on their web site, we should probably try to contact Food Bank of Southern California and see if they might be willing to share information about their agencies with us, and offer to provide our widget for their site.

GigiUxR commented 1 year ago

More questions:

staceyrebekahscott commented 1 year ago

@entrotech Thank you for all this terrific information.

@GigiUxR I would like to incorporate this into the data validation process project that has been discussed, but I am not yet ready to start that planning process. I am moving this into the Prioritized Backlog for now. I intend to get started on that planning in the next few weeks, and at that point I would very much like to continue working with you and the UX Research team on this.

staceyrebekahscott commented 1 year ago
staceyrebekahscott commented 1 year ago

Given the success we have had recently working with the UCLA students for data validation, I am moving this to the Icebox.

This will likely come up as one of our medium term priorities as we work to achieve our metric of being the most comprehensive food pantry listing site in Los Angeles. This will require us to verify against a definitive data source, to make sure we are including all of the services out there.

We also need to make sure we include a source for farmers markets. We had discussed including farmers markets as they take food benefits and it could be a way to differentiate FOLA from other food pantry directory sites.