home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
72.95k stars 30.52k forks source link

Scrape stopped working for me #124358

Closed sammyke007 closed 2 months ago

sammyke007 commented 2 months ago

The problem

Since a couple of weeks, my Scrape stopped working. I didn't change anything and AFAIK, the scraped wesbite didn't change neither. If I scrape using eg. Webscraper (Chrome) it still shows me the requested data.

The page I'm scraping with GET: https://www.tijd.be/customers/mediafin.be/funds_tijd/1423098/Fund/60052461/

I'm scraping these 2 elements:

Price: Selector = #container > header.clearfix.header-stats > div:nth-child(1) > span > span Value template = {{ value|replace(".","")|replace(",",".")|float(0) }}

Date: Selector = #container > header.clearfix.header-stats > div:nth-child(1) > label Value template = {{ as_timestamp(strptime((value|regex_findall_index(find='([0-9]+/[0-9]+/[0-9]+)',index=0, ignorecase=False)), "%d/%m/%Y")) | timestamp_custom('%d-%m-%Y') }}

What version of Home Assistant Core has the issue?

core-2024.8.2

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Scrape

Link to integration documentation on our website

https://www.home-assistant.io/integrations/scrape

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Logger: homeassistant.components.scrape.sensor
Bron: components/scrape/sensor.py:189
integratie: Scrape ([documentatie](https://www.home-assistant.io/integrations/scrape), [problemen](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+scrape%22))
Eerst voorgekomen: 15:08:41 (2 gebeurtenissen)
Laatst gelogd: 15:08:41

Index '0' not found in sensor.niw_25
Index '0' not found in sensor.date_25

AND

Logger: homeassistant.components.sensor
Bron: helpers/entity_platform.py:598
integratie: Sensor ([documentatie](https://www.home-assistant.io/integrations/sensor), [problemen](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+sensor%22))
Eerst voorgekomen: 15:08:41 (1 gebeurtenissen)
Laatst gelogd: 15:08:41

Error adding entity sensor.date_25 for domain sensor with platform scrape
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/entity_platform.py", line 598, in _async_add_entities
    await coro
  File "/usr/src/homeassistant/homeassistant/helpers/entity_platform.py", line 912, in _async_add_entity
    await entity.add_to_platform_finish()
  File "/usr/src/homeassistant/homeassistant/helpers/entity.py", line 1365, in add_to_platform_finish
    await self.async_added_to_hass()
  File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 202, in async_added_to_hass
    self._async_update_from_rest_data()
  File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 210, in _async_update_from_rest_data
    value = template.async_render_with_possible_json_value(value, None)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 771, in async_render_with_possible_json_value
    render_result = _render_with_context(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2638, in _render_with_context
    return template.render(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 1304, in render
    self.environment.handle_exception()
  File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 939, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<template>", line 1, in top-level template code
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2355, in regex_findall_index
    return regex_findall(value, find, ignorecase)[index]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
IndexError: list index out of range

Additional information

home-assistant[bot] commented 2 months ago

Hey there @fabaff, @gjohansson-st, mind taking a look at this issue as it has been labeled with an integration (scrape) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `scrape` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign scrape` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


scrape documentation scrape source (message by IssueLinks)

gjohansson-ST commented 2 months ago

Could you enable debugging so it will print what it actually gets from the webpage

sammyke007 commented 2 months ago

All I can find in DEBUG log is:

2024-08-22 11:08:52.862 WARNING (MainThread) [homeassistant.components.scrape.sensor] Index '0' not found in sensor.niw_50
2024-08-22 11:08:52.913 WARNING (MainThread) [homeassistant.components.scrape.sensor] Index '0' not found in sensor.niw_25
2024-08-22 11:08:52.914 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2257, in forgiving_float_filter
    return float(value)
           ^^^^^^^^^^^^
ValueError: could not convert string to float: 'None'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 258, in _handle_refresh_interval
    await self._async_refresh(log_failures=True, scheduled=True)
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 453, in _async_refresh
    self.async_update_listeners()
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 168, in async_update_listeners
    update_callback()
  File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 236, in _handle_coordinator_update
    self._async_update_from_rest_data()
  File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 210, in _async_update_from_rest_data
    value = template.async_render_with_possible_json_value(value, None)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 771, in async_render_with_possible_json_value
    render_result = _render_with_context(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2638, in _render_with_context
    return template.render(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 1304, in render
    self.environment.handle_exception()
  File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 939, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<template>", line 1, in top-level template code
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2260, in forgiving_float_filter
    raise_no_default("float", value)
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 1871, in raise_no_default
    raise ValueError(
ValueError: Template error: float got invalid input 'None' when rendering template '{{ value | replace(",",".") | float }}' but no default was specified
2024-08-22 11:08:52.914 WARNING (MainThread) [homeassistant.components.scrape.sensor] Index '0' not found in sensor.date_25
2024-08-22 11:08:52.915 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 258, in _handle_refresh_interval
    await self._async_refresh(log_failures=True, scheduled=True)
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 453, in _async_refresh
    self.async_update_listeners()
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 168, in async_update_listeners
    update_callback()
  File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 236, in _handle_coordinator_update
    self._async_update_from_rest_data()
  File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 210, in _async_update_from_rest_data
    value = template.async_render_with_possible_json_value(value, None)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 771, in async_render_with_possible_json_value
    render_result = _render_with_context(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2638, in _render_with_context
    return template.render(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 1304, in render
    self.environment.handle_exception()
  File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 939, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<template>", line 1, in top-level template code
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2355, in regex_findall_index
    return regex_findall(value, find, ignorecase)[index]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
IndexError: list index out of range
2024-08-22 11:10:12.299 DEBUG (MainThread) [homeassistant.components.scrape.coordinator] Raw beautiful soup: <!DOCTYPE html>
<html style="height:100%"><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<meta content="noindex" name="robots"/>
<title> 403 Blocked
</title></head>
<body style="color: #444; margin:0;font: normal 14px/20px Arial, Helvetica, sans-serif; height:100%; background-color: #fff;">
<div style="height:auto; min-height:100%; "> <div style="text-align: center; width:800px; margin-left: -400px; position:absolute; top: 30px; left:50%;">
<h1 style="margin:0; font-size:150px; line-height:150px; font-weight:bold;">403</h1>
<h2 style="margin-top:20px;font-size: 30px;">Bot detection
</h2>
<p></p><h1>You were blocked from  78.22.217.119</h1>
<hr/>
<!--<h1>Reason: 0.d6f51202.1724317812.21df19c2</h1>-->
<ul style="list-style-type: none;">
<li>If you are using a VPN, please disable it or configure split tunnelling</li>
<li>Indien u een VPN gebruikt, gelieve deze te willen uitschakelen of de split tunneling te willen configureren</li>
<li>Si vous utilisez un VPN, veuillez le désactiver ou configurer le "split tunneling"</li>
</ul>
<hr/>
<h3>Contact support for more information:<br/></h3>
<iframe height="500" src="https://rossel.emsecure.net/optiext/optiextension.dll?ID=PbkPlhTYZtH_g_auj8bQ7OfcH_gdiLUtiHDT5WZlt8qrAA_5H6dpQAWJSkVmj4zvYmUBafklFzxRHdEdGQCQUWVQVR8Xx&amp;ref=0.d6f51202.1724317812.21df19c2" title="BotManager Support" width="800"></iframe>
<p>
</p>
</div></div>
</body></html>
2024-08-22 11:10:12.299 DEBUG (MainThread) [homeassistant.components.scrape.coordinator] Finished fetching Scrape Coordinator data in 0.115 seconds (success: True)
2024-08-22 11:10:12.300 WARNING (MainThread) [homeassistant.components.scrape.sensor] Index '0' not found in sensor.niw_25
2024-08-22 11:10:12.301 DEBUG (MainThread) [homeassistant.components.scrape.sensor] Parsed value: None
2024-08-22 11:10:12.301 WARNING (MainThread) [homeassistant.components.scrape.sensor] Index '0' not found in sensor.date_25
2024-08-22 11:10:12.301 DEBUG (MainThread) [homeassistant.components.scrape.sensor] Parsed value: None
2024-08-22 11:10:12.302 ERROR (MainThread) [homeassistant.components.sensor] Error adding entity sensor.date_25 for domain sensor with platform scrape
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/entity_platform.py", line 598, in _async_add_entities
    await coro
  File "/usr/src/homeassistant/homeassistant/helpers/entity_platform.py", line 912, in _async_add_entity
    await entity.add_to_platform_finish()
  File "/usr/src/homeassistant/homeassistant/helpers/entity.py", line 1365, in add_to_platform_finish
    await self.async_added_to_hass()
  File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 202, in async_added_to_hass
    self._async_update_from_rest_data()
  File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 210, in _async_update_from_rest_data
    value = template.async_render_with_possible_json_value(value, None)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 771, in async_render_with_possible_json_value
    render_result = _render_with_context(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2638, in _render_with_context
    return template.render(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 1304, in render
    self.environment.handle_exception()
  File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 939, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<template>", line 1, in top-level template code
  File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2355, in regex_findall_index
    return regex_findall(value, find, ignorecase)[index]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
IndexError: list index out of range
sammyke007 commented 2 months ago

Index 0 not found, but it has worked for months without any issue...

Using Webscraper (Chrome extension) shows the correct data however:

image

gjohansson-ST commented 2 months ago

Well, the page has blocked the request. Maybe because you're pooling too often or whatever reason. It's all there if you read the raw data in the debug log.

<!DOCTYPE html>
<html style="height:100%"><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<meta content="noindex" name="robots"/>
<title> 403 Blocked
</title></head>
<body style="color: #444; margin:0;font: normal 14px/20px Arial, Helvetica, sans-serif; height:100%; background-color: #fff;">
<div style="height:auto; min-height:100%; "> <div style="text-align: center; width:800px; margin-left: -400px; position:absolute; top: 30px; left:50%;">
<h1 style="margin:0; font-size:150px; line-height:150px; font-weight:bold;">403</h1>
<h2 style="margin-top:20px;font-size: 30px;">Bot detection
</h2>
<p></p><h1>You were blocked from  78.22.217.119</h1>
<hr/>
<!--<h1>Reason: 0.d6f51202.1724317812.21df19c2</h1>-->
<ul style="list-style-type: none;">
<li>If you are using a VPN, please disable it or configure split tunnelling</li>
<li>Indien u een VPN gebruikt, gelieve deze te willen uitschakelen of de split tunneling te willen configureren</li>
<li>Si vous utilisez un VPN, veuillez le désactiver ou configurer le "split tunneling"</li>
</ul>
<hr/>
<h3>Contact support for more information:<br/></h3>
<iframe height="500" src="https://rossel.emsecure.net/optiext/optiextension.dll?ID=PbkPlhTYZtH_g_auj8bQ7OfcH_gdiLUtiHDT5WZlt8qrAA_5H6dpQAWJSkVmj4zvYmUBafklFzxRHdEdGQCQUWVQVR8Xx&amp;ref=0.d6f51202.1724317812.21df19c2" title="BotManager Support" width="800"></iframe>
<p>
</p>
</div></div>
</body></html>
sammyke007 commented 2 months ago

But I can access it with Chrome and other browsers without any problem?

Can the bot detection be fooled?

WebSpider commented 2 months ago

But I can access it with Chrome and other browsers without any problem?

Can the bot detection be fooled?

Probably, but:

  1. It's a cat and mouse game between the website operators and you
  2. You would probably be better off just asking for permission
  3. How to do this isn't really inside the domain of home-assistant, as it wildly depends per website

Exactly this is one of the reasons why the scrape integrations has this warning at the integrations page:

As this is not a full-blown web scraper like scrapy, it will most likely only work with simple web pages and it can be time-consuming to get the right section.