Closed cryarr closed 4 years ago
@cryarr the database also has a source
column. Would it be possible to include that as well, or should I just leave it blank?
Same for cook_time
and prep_time
, also database columns.
A more database structure friendly version of this would be:
{
"author": <author>
"title": <title>
"source": <source>
"ingredients": [
{
"name" <name>
"amount": <amount>
"measurement": <measurement>
},
{...}
],
"directions": [ "do X", "do Y", ... ]
}
Some of these are kind of hard to do, but at minimum if it was possible to have directions split out into an array that would be really helpful.
Cook time and prep time should be left blank i just forgot to include those. They aren't actually available for epicurious sadly but for others they are. And Yes I can include a source column.
I can attempt to make directions into an array, but the ingredients would have to be completely parsed out which would require a lot more work with natural language processing. @horinezachary Has something like this format for the ingredients worked with your natural language processor?
If we could use something like this for the ingredients then I could handle that on the backend. I'd be OK with even a really simple heuristic for the directions like breaking on periods or in between <p>
blocks. Might be something that could be customized on a per site basis.
Sounds good. Ill try to break it up between periods and see what that yields.
I added some more structure.
Looks great. yield
I think is not in the database, but doesn't hurt to leave it in. Let me know how splitting up the instructions goes. Gonna have to figure out if we need to handle the unicode characters in the ingredients lines separately as well.
Looking at the ingredient parser model I think it's pretty conceivable that we could spin that up and put it behind an API as well. I have the docker image running on my machine atm and I'll report how it goes.
Okay I got that all working now so it parses as an array on directions on periods. Ill start the running of the scraper so that you can get that full data set.
Issues are linked here: current format "author" : "title": "ingredients": "yield" "Directions" "Url"
total time will be after yield when available