Parse all websites into similar format

horinezachary / goose-database

This repository holds the frontend code and supporting code for the Goose Database Recipe Repository.

https://goose.horine.dev

1 stars 1 forks source link

Parse all websites into similar format #18

Closed cryarr closed 4 years ago

cryarr commented 4 years ago

Issues are linked here: current format "author" : "title": "ingredients": "yield" "Directions" "Url"

total time will be after yield when available

cryarr commented 4 years ago

jemisonf commented 4 years ago

@cryarr the database also has a source column. Would it be possible to include that as well, or should I just leave it blank?

jemisonf commented 4 years ago

Same for cook_time and prep_time, also database columns.

jemisonf commented 4 years ago

A more database structure friendly version of this would be:

{
  "author": <author>
  "title": <title>
  "source": <source>
  "ingredients": [
    {
      "name" <name>
      "amount": <amount>
      "measurement": <measurement>
    },
    {...}
  ],
  "directions": [ "do X", "do Y", ... ]
}

Some of these are kind of hard to do, but at minimum if it was possible to have directions split out into an array that would be really helpful.

cryarr commented 4 years ago

Cook time and prep time should be left blank i just forgot to include those. They aren't actually available for epicurious sadly but for others they are. And Yes I can include a source column.

I can attempt to make directions into an array, but the ingredients would have to be completely parsed out which would require a lot more work with natural language processing. @horinezachary Has something like this format for the ingredients worked with your natural language processor?

jemisonf commented 4 years ago

If we could use something like this for the ingredients then I could handle that on the backend. I'd be OK with even a really simple heuristic for the directions like breaking on periods or in between <p> blocks. Might be something that could be customized on a per site basis.

cryarr commented 4 years ago

Sounds good. Ill try to break it up between periods and see what that yields.

cryarr commented 4 years ago

I added some more structure.

jemisonf commented 4 years ago

Looks great. yield I think is not in the database, but doesn't hurt to leave it in. Let me know how splitting up the instructions goes. Gonna have to figure out if we need to handle the unicode characters in the ingredients lines separately as well.

Looking at the ingredient parser model I think it's pretty conceivable that we could spin that up and put it behind an API as well. I have the docker image running on my machine atm and I'll report how it goes.

cryarr commented 4 years ago

Okay I got that all working now so it parses as an array on directions on periods. Ill start the running of the scraper so that you can get that full data set.