Ban integrations from core that rely on web scraping

balloob commented 5 years ago

Context

Webscraping is when we use code to mimic a user and log in to a website and get data in Home Assistant. This is usually needed because certain data sources/integrations do not offer an API.

Webscraping comes with the following downsides:

Very fragile, break often. When the website is updated, the integration will need to be updated.
Some vendors (like USPS) have IP banned users of such integrations
Some rely on beautifulsoup (Python-based), others are relying on PhantomJS or other headless browsers, meaning we need to include a whole browser.

Proposal

We no longer accept any new integration that relies on webscraping.
We identify, deprecate for 2 releases and remove integrations that rely on webscraping
It will still be possible to have custom integrations provide information via webscraping

Consequences

Integrations that rely on webscraping will have to be maintained as custom integrations.

amelchio commented 5 years ago

Does this include integrations that use a reverse-engineered AJAX/WebSockets API?

How about integrations connecting to local devices, like admin pages on routers?

ties commented 5 years ago

"It will still be possible to have custom integrations provide information via web scraping"

Will this change be implemented before or after there is an official mechanism for custom integrations?

As far I'm aware, there is no mechanism to manager (i.e keep up to date0 custom integrations?

On Thu, 20 Jun 2019, 22:53 Anders Melchiorsen, notifications@github.com wrote:

Does this include integrations that use a reverse-engineered AJAX/WebSockets API?

How about integrations connecting to local devices, like admin pages on routers?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/home-assistant/architecture/issues/252?email_source=notifications&email_token=AABQTESB3YZZ4YPPRM34Z4DP3PU4ZA5CNFSM4HZ52FJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYGSRNY#issuecomment-504178871, or mute the thread https://github.com/notifications/unsubscribe-auth/AABQTES6G6LRLAN64VJM4A3P3PU4ZANCNFSM4HZ52FJQ .

frenck commented 5 years ago

I could not agree more.

@amelchio Is hitting a thin line there. My proposal would be any integration that scrapes information:

from an HTML-based website
needs regular expressions, dom parsing or css filters to get specific values

@ties There is already a mechanism for custom integrations (as in, you can add those to your install). Secondly, there is a community effort called HACS, which provides update possibilities for custom components.

balloob commented 5 years ago

@frenck I agree with that proposal.

It's not the goal to restrict using undocumented APIs. Just purely webscraping ones.

balloob commented 5 years ago

Example of code I would want to remove

https://github.com/home-assistant/home-assistant/blob/0bdbf007b256cc7c5bfddc474d16b825d62652ac/homeassistant/components/sytadin/sensor.py#L123-L127

ties commented 5 years ago

@frenck i am aware of that. I would just be hesitant to depreciate any integration that uses an unsupported API before there is a supported mechanism to maintain third party components.

Sec, since it is python, we would always 'support' (aka allow the possibility for) third party integrations

Swamp-Ig commented 5 years ago

Agreed, maintenance issues with these.

Can we be more explicit and also ban stuff that just hides their web scraping in an external module? IDK if they're around, but if they are they're just as problematic.

frenck commented 5 years ago

@Swamp-Ig IMHO, yes! @ties This is not about unsupported API's.

balloob commented 5 years ago

@Swamp-Ig we should not allow such integrations either. Besides auditing, we could achieve that by adding extra package constraints during linting that would break installation of common webscraping utilities.

Swamp-Ig commented 5 years ago

If someone was feeling up to it they could potentially create a webscrape sensor, in the style of rest or template sensors, that allows users to set them up externally and maintain them themselves.

rohankapoorcom commented 5 years ago

@swamp-Ig - it feels better to me if there's no support in Home Assistant core for web scraping. I don't think we want to support situations in which people used a new scraper sensor and got banned, etc.

balloob commented 5 years ago

@Swamp-Ig https://www.home-assistant.io/components/scrape/ 🙄 it would be removed as part of this issue.

Swamp-Ig commented 5 years ago

Didn't know about that one! Kick it out to a custom component?

Depends on what the goals are really for this issue, is it to reduce core maintenance, or is it to avoid getting banned for scraping? If the goal is the first one, then keeping it makes sense, if not then the second makes sense. Non-generic scrapers make no sense from either point of view.

It's not something I use personally so I have no skin in the game, but I could definitely see an argument to keep it, that way it's ultimately not a core responsibility to keep the actual web scraper working, but there's still the option there if someone finds that helpful and they're willing to keep their own user config maintained and working.

If it did get put into a custom component though then we'd still need someone to adopt it as a custom component.

pvizeli commented 5 years ago

@balloob not sure about https://www.home-assistant.io/components/scrape/

So yeah, you can easy use rest sensor and handle that with templates. But that tool is a simple wrapper to get details from a website because a website is a also an interface.

I agree to remove others are relying on PhantomJS or other headless browsers, meaning we need to include a whole browser. because that kind of stuff is wired and also integration they use that because of maintaining and changed website. scrape is a generic helper and user are self care about maintaining his config.

So please don't remove scrape.

balloob commented 5 years ago

Ok, we can keep scrape

frog32 commented 5 years ago

We could even go as far as providing a full blown headless browser version of scrape to enable users to do even more. I think what this is about is to prevent external website owners from breaking users which requires homeassistant to deliver a fix.

As example: I am using scrape to get the date of the next paper recycling collection from my municipality. This is very useful to me but if they change their website it will only affect me as a user and I have to fix my scrape sensor, no need for homeassistant to ship a bugfix release.

pvizeli commented 5 years ago

@frog32, yes that what I mean. A generic wrapper is the maintaining of the users problem. A components/library they use that to get data should be a custom component and not a core integration.

balloob commented 5 years ago

Added as ADR-004 in #253

amelchio commented 5 years ago

I am not sure how to handle this when the issue is already closed but apparently there is room for misunderstandings in ADR-0004.

I asked above "How about integrations connecting to local devices, like admin pages on routers?" and the following discussion talked only about "websites" so I assumed this decision did not relate to local devices. This is how I think it should be so I did not comment any further.

However, now @frenck has clarified that he meant "website" to also include local devices.

Scraping local device HTML is about as fragile as relying on JSON in an unofficial API on that same device so I don't see why we would disallow the first combination but allow the second. None of the downsides listed in Context seem to apply for local devices.

To be clear, I am not trying to reverse this decision but I wonder if I am the only one confused about what was actually decided?

frenck commented 5 years ago

"None of the downsides listed in Context seem to apply for local devices."

Is it?

This listed downside applies:

"Some rely on beautifulsoup (Python-based), others are relying on PhantomJS or other headless browsers, meaning we need to include a whole browser."

Definition of "website" (source: Wikipedia):

A website[1] or web site[2] is a collection of related network web resources, such as web pages, multimedia content, which are typically identified with a common domain name, and published on at least one web server. Notable examples are wikipedia.org, google.com, and amazon.com.

Websites can be accessed via a public Internet Protocol (IP) network, such as the Internet, or a private local area network (LAN), by a uniform resource locator (URL) that identifies the site.

From the accepted proposal:

"We no longer accept any new integration that relies on webscraping".

Which does not even limit to local, public or even "website".

IMHO, when it comes to web scraping, it should not matter if it is local or not. The "horror" applies to both.

amelchio commented 5 years ago

Well, we have established empirically that "website" is ambiguous and thus we should use more precise wording, not try to win a definition game. At least, ADR-0004 should be updated to resolve this ambiguity.

Likewise, maybe you can explain "horror"? For local devices, we have this method of getting an access token. We have UDP-based, reverse-engineered, binary protocols. We parse undocumented XML. How does HTML stand out as particularly bad?

In my understanding, this issue was not about HTML as such but about avoiding the problems that are due to other people controlling the HTML delivery (modifying it at will, IP banning, ...).

I agree that actually rendering the HTML is too much and I am fine with banning PhantomJS.

ties commented 5 years ago

Well, we have established empirically that "website" is ambiguous and thus we should use more precise wording, not try to win a definition game. At least, ADR-0004 should be updated to resolve this ambiguity.

There is definitely ambiguity there. In my opinion most (undocumented, REST) API's fall outside of the definitions per the new interpretation/word game.

In my understanding, this issue was not about HTML as such but about avoiding the problems that are due to other people controlling the HTML delivery (modifying it at will, IP banning, ...).

I agree that actually rendering the HTML is too much and I am fine with banning PhantomJS.

I think that you have the core of the issue there. It is about control over delivery. For devices where the user has physical control and/or controls what firmware runs on the device, scraping is equivalent to using an undocumented API [with HTML being very close to XML], with cookies as a mechanism for short lived access tokens.

pvizeli commented 5 years ago

I close the comments because the ADR is merged and in use. Website change to often and most of the components are not well maintained. API based on SOAP/XML/JSON are mostly also used by other products and can't fast change without new generation of products, unlike a website. You can use some generic webscrable components and update the config if they not work because they update the website again or use custom components.

home-assistant / architecture