onetsp / RecipeParser

A PHP library for parsing structured recipe data from HTML files.
https://onetsp.com/
MIT License
93 stars 26 forks source link

Resulting recipe ingredients incomplete - importing only words from <a> links #28

Open sfatfarma opened 6 years ago

sfatfarma commented 6 years ago

Hello,

First of all, I have to congratulate you for the library, it is really awesome! I found an issue today with some recipes.

I try to parse this webpage: http://www.foodista.com/recipe/H5M86RVB/breakfast-casseroles

The preparation instructions are:

Grease a 9 x 13 inch casserole dish. Line dish with unbaked crescent rolls. Spread cooked meat evenly over rolls. Pour beaten eggs over meat. Place cheese over layer of eggs. Bake at 350 degrees for 45 minutes or until firm. Cool 10 minutes before cutting into squares and serve.

But the library returns in the 'Instructions' response:

                        [0] => dish.
                        [1] => dish
                        [2] => rolls.
                        [3] => Bake
                        [4] => Cool
                        [5] => cutting
                        [6] => serve

which corresponds to words from content that have links attached to them.

Thank you for your time checking on this.

Regards, Szabi.

sfatfarma commented 6 years ago

I just checked on this and it seems that the issue is in the MicrodataSchema.php file.

It tries to get the instructions from the html file using this xpath:

$nodes = $xpath->query('//[@itemprop="recipeInstructions"]/');

This matches only the links in the recipe's instructions. I solved this, by modifying your code like this:

if (!$found) { $nodes = $xpath->query('//*[@itemprop="recipeInstructions"]/*'); if ($nodes->length) { $only_a = true; for ($i = 0; $i < $nodes->length; $i++) { if($nodes->item($i)->tagName != 'a') { $only_a = false; } } if($only_a == false) { RecipeParser_Text::parseInstructionsFromNodes($nodes, $recipe); $found = true; // Recipe.com gets caught up in here, but doesn't have well-formed nodes wrapping each ingredient. } } } In this case, the parser will check if it found only links and will continue in this unfortunate case. :)

If you like this solution, feel free to include it in your code.

Please let me know your opinion on this.

Regards.