Webscraping for local devices

amelchio commented 5 years ago

Context

In ADR-0004 it was decided that webscraping will not be allowed, with webscraping being defined like this:

Webscraping is when we use code to mimic a user and log in to a website and get data in Home Assistant.

However, that definition has been found to be quite ambiguous:

It does not distinguish between data formats. We have actually now banned structured data formats like JSON which was apparently not the intention.
The definition seems to only apply to websites with authentication.
It is unclear what "a website" means, in particular whether local appliances with an embedded webserver (such as routers and printers) are included.

The first two points just look like oversights (or maybe even nitpicking) but the third point has caused real confusion. This new issue is to continue the decision process since the original issue #252 has already been closed.

Proposal

We disallow integrations that get specific values from publicly hosted HTML pages.
We do allow integrations that read structured data like JSON or XML.
We do allow integrations that interpret HTML from devices on the local network. This can use e.g. regular expressions and DOM parsing but not a full-blown browser engine (even if headless).

Consequences

ADR-0004 becomes less ambiguous. Unauthenticated access is included in the ban. Data exchange formats are excluded from the ban. Local devices are explicitly exempted from the ban.

amelchio commented 5 years ago

So my proposal is to allow scraping of local devices. I think local devices generally do not have the downsides that the original issue wanted to counter.

Firmware is rarely updated significantly once a device has shipped so the HTML structure does not change like a public website does. When it does, fixing it is easier than fixing a reverse engineered binary protocol.

Banning such integrations seems to throw the baby out with the bathwater.

frenck commented 5 years ago

I don't see the issue with the "banning". Since clarification is important: Banning is an inappropriate word to use.

We don't block them, we currently do not allow them to be part of the core repository. However, you can still use and distribute them as part of custom integration.

pvizeli commented 5 years ago

Firmware is rarely updated significantly once a device has shipped so the HTML structure does not change like a public website does

What should we do if 50% of our user have the new firmware and other 50% not? I guess the library need handle that and not HA but that will be dificult. With custom component, user can run what every they need to parse here website.

custom components are not bad and we should not pack everthing into core integration/repository.

ties commented 5 years ago

It comes down to balancing user value and maintenance. Compared to other domotics product, home assistant tends to go for user value.

Using the line of reasoning by @pvizeli and @Swamp-Ig, proposed text:

Many integrations are not actively maintained. Websites change often and are outside the control of users. Integrations break due to changes in websites, or when users get banned for making too many requests. Both of these problems lead to created issues on HASS core.

(Non-public) APIs based on SOAP/XML/JSON/... are commonly used by other products, and cannot be changed often without accompanying changes in the products using the API. That leads to the situation where integrations scraping websites break much more often than integrations consuming APIs. In contrast, integrations that scrape the interface of an embedded device on the local network, where the firmware is not often updated (i.e., CPE) and/or are under the user their physical control, are stable.

The main goal of this ADR is to reduce the number of broken integrations in HASS core, which often lead to issues to be fixed by core developers when the original author is not actively maintaining the integration.

This ADR moves toward this goal by banning web scraping in situations where the consumed endpoint is not under the user their physical control (i.e., local network) and requiring components that scrape local devices to actively maintained, by having a maintainer, and by containing the scraping code in a python package, starting from the first release in October.

raccettura commented 5 years ago

Firmware is rarely updated significantly once a device has shipped so the HTML structure does not change like a public website does

What should we do if 50% of our user have the new firmware and other 50% not? I guess the library need handle that and not HA but that will be dificult. With custom component, user can run what every they need to parse here website.

Couldn't you make this argument for any api in general? I've dealt with many API's that have broken in some way over the years.

I'd suggest something under local control is a different use-case than scraping a website that's subject to random updates. In this case some 3rd party API (or anything run by Google) is more likely to change without notice or be deprecated.

frenck commented 5 years ago

I personally have big troubles understanding this whole discussion.

ADR-0004 is about disallowing integrations in our core codebase that uses:

Machines/code to interpret with an interface that was not meant/designed to be interpreted with or consumed by machines/code.
Machines/code to interact with an interface that was not meant/designed to interact with or consumed by machines/code.

ADR-0004 was especially aimed at the "website" part of this.

☝️ The above... still stands with the local interfaces that are now discussed.

amelchio commented 5 years ago

@frenck Actually, ADR-0004 makes no such distinction. It is about being very fragile and we are making an argument that this does not apply to scraping local devices. Do you believe scraping of local devices to be very fragile? (It sounds like you are mostly opposed because it is dirty in some theoretical way.)

What should we do if 50% of our user have the new firmware and other 50% not?

@pvizeli I think that is a valid concern but probably more relevant in e.g. #266. Your argument seems to be about maintenance and distribution, not really about webscraping.

I do agree that better supported custom components could make this discussion moot. However, we are not there yet and being left out of core currently is a penalty. We should provide good reasons for imposing that penalty on contributors.

arsaboo commented 5 years ago

I am generally against blanket rules such as disallowing any web-scraping. Many APIs are more prone to breaking than scraped content. Unless the website explicitly blocks scraping and things break too often, I say, we can allow them.

balloob commented 5 years ago

I'm fine with updating ADR-004 for 1 and 2.

I think that we should still not allow web scraping of local devices as part of built-in integrations. A local webserver still serves websites.

amelchio commented 5 years ago

A local webserver still serves websites.

Do you believe scraping of local devices to be very fragile?

balloob commented 5 years ago

Not very, but still fragile. However, webscraping also is done in huge packages like headless Chrome or PhantomJS, causing big installation times and common breakage.

amelchio commented 5 years ago

Not very, but still fragile.

So like most integrations based on unofficial methods.

However, webscraping also is done in huge packages like headless Chrome or PhantomJS, causing big installation times and common breakage.

Not according to what’s proposed here 🙄

…

Maybe someone can link to a couple of relevant issues that we would like to avoid?

amelchio commented 5 years ago

Maybe someone can link to a couple of relevant issues that we would like to avoid?

Okay, I know you are probably thinking "dude, just accept that we have decided against this".

But to recap the ADR blog, the reason that we record architecture decisions is to document why we do certain things. The entire point is that we should not blindly accept previous decisions.

So I reiterate: ADR-0004 does not currently explain why we would disallow webscraping of local devices.

Likewise, the comments in this issue just do not focus on the issue at hand: scraping local device HTML.

So what do we put into ADR-0004 that can explain this decision to newcomers?

balloob commented 5 years ago

I think that we should not distinguish between local or remote devices and/or services. Although some local devices/services are rarely updated, some might change frequently, especially if it's a server application (like Home Assistant) and not a physical device.

Web scraping comes in different shapes/approaches. PhantomJS, headless Chrome, BeatifulSoup and plain old regexes on the text. Headless Chrome require big amount of space, memory and CPU. PhantomJS is similar, but also is no longer maintained. BS can break if the selectors for DOM elements no longer match, regexes can break if a single character in the response changes.

amelchio commented 5 years ago

Sorry, that is all correct but we should not just explain that HTML scraping can break. Then we should disallow everything.

Because things break everywhere. We just pushed a breaking change in a point release because a documented API is suddenly closing down.

The Harmony websocket fiasco. The Nest API shutting down. IKEA reaching out about Trådfri. Breakage just happens, HTML or not.

A local device is comparatively stable in that regard. Of course local device integrations can also break with updates. I have fixed several of those myself; no HTML was involved though one was anyway a regex fix.

So I still don’t see any evidence that scraping local device HTML is particularly prone to break. And I think it is a shame to disallow it for no reason.

As for this ticket, I will accept that I was unable to explain my stance in a way that won anyone over. I will leave the issue open for two days in case others want to comment and then I will just close it.

pvizeli commented 5 years ago

We speak about requirements to be delivered with HA not to be an integration. Such integration can be maintained outside core Repository as custom component and have a long life or with community project like HACS.

dgomes commented 5 years ago

I support @amelchio position. I really think we should distinguish a complex web scrapping integration that requires god only knows how much boiler plate from simple and/or stable machine readable interfaces. An HTML page or a XML/JSON API for a local device is really not very differently, yet we banned 1st and accept the 2nd.

ties commented 5 years ago

A local device is comparatively stable in that regard. Of course local device integrations can also break with updates. I have fixed several of those myself; no HTML was involved though one was anyway a regex fix.

So I still don’t see any evidence that scraping local device HTML is particularly prone to break. And I think it is a shame to disallow it for no reason.

While the discussion on banning/allowing components that have scraping is a loaded one, I see a consensus here when it comes to clarification/formalization of aspects of stable integrations.

I think there is value in formalizing this. Classifying the stability of the interface used is orthogonal to the classification of device connections as documented in the blog post.

balloob commented 5 years ago

@dgomes we have never banned, nor was it the intention, to ban XML/JSON. This is purely about HTML.

home-assistant / architecture