Generalise the HTML parsing procedure to allow for more parsing options

mrzapp commented 4 years ago

In order to support more websites, a standardisation of the HTML parsing procedure should be developed.

The general idea is that, when downloading a recipe, we should loop through parsers until one succeeds. That way, new parsers can be added modularly without interfering with the main code base.

Parsers currently that need to be refactored into this new structure are:

JSON
HTML

mrzapp commented 4 years ago

@Teifun2 if you want to allow creation of site-specific parsers, I suggest you make it a subclass of the HTML parser, and design the API so new parsers can be made with as little code as possible. Ideally, I think, custom parsers could even just be a config file describing which XPath selectors to use for which fields.

christianlupus commented 3 years ago

@Teifun2, @mrzapp did you do anything regarding per-website-specific parsing?

Teifun2 commented 3 years ago

I have looked into some techniques but have never actually started any code :/

christianlupus commented 3 years ago

This might be highly related to #388 as we might want to (or not) scrape on the client or the server.

mrzapp commented 3 years ago

I didn't start anything, but I propose adopting the youtube-dl model.

First, a generic scraper should be made, based on the existing code. The main method could be something like parseHtml($html), which could be overridden by other scrapers extending it. The scrapers could be in their own directory, like lib/Scraper. Upon fetching the HTML of a website, all scrapers would be "asked" if they could handle a certain domain, using a method like canHandle($domain).

Example:

class GenericScraper {
    static function canHandle(string $domain) -> bool {
        return true; 
    }

    static function parseHtml(string $html) -> array {
        $recipe = [];

        // Parse HTML here

        return $recipe;
    }
}

class CustomScraper extends GenericScraper {
    static function canHandle(string $domain) -> bool {
        return $domain === 'example.com';
    }

    static function parseHtml(string $html) -> array {
        $json = parent::parseHtml($html);

        // Add some extra fields here

        return $json;
    }
}

christianlupus commented 3 years ago

@mrzapp I tried something like this in #553 if I understand your intent correctly. As soon as my dev installation is back up (recent crash with docker needs resolving), I will have to close some PRs and check if I can fix the NC21 installation issue reported here recently. After that, I can write some unit tests to merge #553.

The question at hand is more like should the parsing been done on the server (drawbacks in performance and problems with login credentials) or on the client using JS (needs implementation, less control over the data). Both seem to have their benefits and drawbacks. We could also combine things by using the server-side importing by default plus an optional client-side script for whatever. It is more a question of how we want to proceed with the app in the long term. Should we focus on the frontend and keep the PHP backend minimal REST-like or should the backend be more heavy-weight and the frontend simpler? For now I try to get the PR merged soon to have a better starting point.

mrzapp commented 3 years ago

@christianlupus I think we should definitely let the server do the parsing, seeing as there are several clients, like the Android app, and potentially more in the future, like mobile Linux devices or desktop apps.

Otherwise, each client would have to implement the parsing themselves, and that would result in more code written by more people. Of course I think feature richness is great, but maintainability should always come first, in my opinion.

christianlupus commented 3 years ago

@mrzapp yes, you are right I did not think about other clients. However, we can without any additional cost add a JS parser and save the recipe just as any manually entered one. @TheMBeat pointed a library out as described in #388. It would still be possible to add client-side parsers for the web frontend. But it makes sense to keep things on the server. For me it does not make much difference and I prefer the PHP side from my personal knowledge perspective.

bfritscher commented 3 years ago

Hi, juste to add to the discussion as I researched some solution for my own needs. As stated in #388 parsing an external site from a frontend client is difficult or impossible without backend proxy due to CORS limitations. Another possibility is to use a browser plugin or bookmarklet which can be injected into the recipe page and then POST data to the api. This works also for sites which do checks to block direct html scrapping but is less user-friendly as the user must be on the external site.

Another possibility is to have a headless browser to do the scrapping like in https://github.com/julianpoy/RecipeClipper (in addition there is some development of machine learning to parse the page if schema.org data is missing. However it requires the overhead of a headless browser on the server.

Ben Awad has an interesting article about his heuristiques for parsing recipes https://www.linkedin.com/pulse/scraping-recipe-websites-ben-awad/?articleId=6665612365909815297

Both solution work, but still miss or confuse things in the test I made on only a few sites.

In line with the proposed implementation of this issue with a custom parser and a default schema parser, there is this python library: https://github.com/hhursev/recipe-scrapers it heavily depends on another parsing library so porting it would require finding an equivalent to https://github.com/scrapinghub/extruct in the php ecosystem.

mrzapp commented 3 years ago

I just have one more thought on the idea of further complicating the scraping logic:

While I do find all of these efforts very interesting and useful, I don't think we've received enough issues for unsupported sites to warrant massive endeavours like machine learning and headless browsers.

All recipe sites should support schema.org, as it is a global standard, and the ones that don't should be notified and helped if necessary, so they can improve their discoverability online. I only imagined the extra scrapers being for the websites that consciously refuse to comply with web standards, even after they have been notified, which should be relatively few. Besides, if some websites are going out of their way to prevent scraping, it's probably wise to stay away from them, as it might cause legal issues, like the ones youtube-dl has faced before.

I'd rather recommend keeping the code base lean, pushing website maintainers towards better discoverability and accommodating a few edge cases than spending valuable developer time deciphering garbled markup.

But again, it's just an opinion. If anyone is willing to put in the time to develop this, and it doesn't cause a degradation of the user experience, then by all means go ahead :)

christianlupus commented 3 years ago

As we are bound by the NC core to PHP and do not have the option to start VMs/headless browsers/... I see not much chance to go that route. Also, we have a browser on the client-side, which we could use if it was needed.

All in all, I see a continuous series of questions/issues here why this or why that page is not compatible and why we are not willing to make it work. I consent that the standardized approach using schema.org would be beneficial.

The question is more if we decided to implement site-specific parsers, where these should be located (frontend vs backend). But for now, let's get the standard running first.

christianlupus commented 2 years ago

The main part of this issue is solved. The current structure should be extended in order to allow for URL based filtering. See also the examples in this issue.

nextcloud / cookbook

Generalise the HTML parsing procedure to allow for more parsing options #217