Closed sammyke007 closed 2 months ago
Hey there @fabaff, @gjohansson-st, mind taking a look at this issue as it has been labeled with an integration (scrape
) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)
scrape documentation scrape source (message by IssueLinks)
Could you enable debugging so it will print what it actually gets from the webpage
All I can find in DEBUG log is:
2024-08-22 11:08:52.862 WARNING (MainThread) [homeassistant.components.scrape.sensor] Index '0' not found in sensor.niw_50
2024-08-22 11:08:52.913 WARNING (MainThread) [homeassistant.components.scrape.sensor] Index '0' not found in sensor.niw_25
2024-08-22 11:08:52.914 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2257, in forgiving_float_filter
return float(value)
^^^^^^^^^^^^
ValueError: could not convert string to float: 'None'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 258, in _handle_refresh_interval
await self._async_refresh(log_failures=True, scheduled=True)
File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 453, in _async_refresh
self.async_update_listeners()
File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 168, in async_update_listeners
update_callback()
File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 236, in _handle_coordinator_update
self._async_update_from_rest_data()
File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 210, in _async_update_from_rest_data
value = template.async_render_with_possible_json_value(value, None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 771, in async_render_with_possible_json_value
render_result = _render_with_context(
^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2638, in _render_with_context
return template.render(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 1304, in render
self.environment.handle_exception()
File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 939, in handle_exception
raise rewrite_traceback_stack(source=source)
File "<template>", line 1, in top-level template code
File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2260, in forgiving_float_filter
raise_no_default("float", value)
File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 1871, in raise_no_default
raise ValueError(
ValueError: Template error: float got invalid input 'None' when rendering template '{{ value | replace(",",".") | float }}' but no default was specified
2024-08-22 11:08:52.914 WARNING (MainThread) [homeassistant.components.scrape.sensor] Index '0' not found in sensor.date_25
2024-08-22 11:08:52.915 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 258, in _handle_refresh_interval
await self._async_refresh(log_failures=True, scheduled=True)
File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 453, in _async_refresh
self.async_update_listeners()
File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 168, in async_update_listeners
update_callback()
File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 236, in _handle_coordinator_update
self._async_update_from_rest_data()
File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 210, in _async_update_from_rest_data
value = template.async_render_with_possible_json_value(value, None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 771, in async_render_with_possible_json_value
render_result = _render_with_context(
^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2638, in _render_with_context
return template.render(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 1304, in render
self.environment.handle_exception()
File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 939, in handle_exception
raise rewrite_traceback_stack(source=source)
File "<template>", line 1, in top-level template code
File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2355, in regex_findall_index
return regex_findall(value, find, ignorecase)[index]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
IndexError: list index out of range
2024-08-22 11:10:12.299 DEBUG (MainThread) [homeassistant.components.scrape.coordinator] Raw beautiful soup: <!DOCTYPE html>
<html style="height:100%"><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<meta content="noindex" name="robots"/>
<title> 403 Blocked
</title></head>
<body style="color: #444; margin:0;font: normal 14px/20px Arial, Helvetica, sans-serif; height:100%; background-color: #fff;">
<div style="height:auto; min-height:100%; "> <div style="text-align: center; width:800px; margin-left: -400px; position:absolute; top: 30px; left:50%;">
<h1 style="margin:0; font-size:150px; line-height:150px; font-weight:bold;">403</h1>
<h2 style="margin-top:20px;font-size: 30px;">Bot detection
</h2>
<p></p><h1>You were blocked from 78.22.217.119</h1>
<hr/>
<!--<h1>Reason: 0.d6f51202.1724317812.21df19c2</h1>-->
<ul style="list-style-type: none;">
<li>If you are using a VPN, please disable it or configure split tunnelling</li>
<li>Indien u een VPN gebruikt, gelieve deze te willen uitschakelen of de split tunneling te willen configureren</li>
<li>Si vous utilisez un VPN, veuillez le désactiver ou configurer le "split tunneling"</li>
</ul>
<hr/>
<h3>Contact support for more information:<br/></h3>
<iframe height="500" src="https://rossel.emsecure.net/optiext/optiextension.dll?ID=PbkPlhTYZtH_g_auj8bQ7OfcH_gdiLUtiHDT5WZlt8qrAA_5H6dpQAWJSkVmj4zvYmUBafklFzxRHdEdGQCQUWVQVR8Xx&ref=0.d6f51202.1724317812.21df19c2" title="BotManager Support" width="800"></iframe>
<p>
</p>
</div></div>
</body></html>
2024-08-22 11:10:12.299 DEBUG (MainThread) [homeassistant.components.scrape.coordinator] Finished fetching Scrape Coordinator data in 0.115 seconds (success: True)
2024-08-22 11:10:12.300 WARNING (MainThread) [homeassistant.components.scrape.sensor] Index '0' not found in sensor.niw_25
2024-08-22 11:10:12.301 DEBUG (MainThread) [homeassistant.components.scrape.sensor] Parsed value: None
2024-08-22 11:10:12.301 WARNING (MainThread) [homeassistant.components.scrape.sensor] Index '0' not found in sensor.date_25
2024-08-22 11:10:12.301 DEBUG (MainThread) [homeassistant.components.scrape.sensor] Parsed value: None
2024-08-22 11:10:12.302 ERROR (MainThread) [homeassistant.components.sensor] Error adding entity sensor.date_25 for domain sensor with platform scrape
Traceback (most recent call last):
File "/usr/src/homeassistant/homeassistant/helpers/entity_platform.py", line 598, in _async_add_entities
await coro
File "/usr/src/homeassistant/homeassistant/helpers/entity_platform.py", line 912, in _async_add_entity
await entity.add_to_platform_finish()
File "/usr/src/homeassistant/homeassistant/helpers/entity.py", line 1365, in add_to_platform_finish
await self.async_added_to_hass()
File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 202, in async_added_to_hass
self._async_update_from_rest_data()
File "/usr/src/homeassistant/homeassistant/components/scrape/sensor.py", line 210, in _async_update_from_rest_data
value = template.async_render_with_possible_json_value(value, None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 771, in async_render_with_possible_json_value
render_result = _render_with_context(
^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2638, in _render_with_context
return template.render(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 1304, in render
self.environment.handle_exception()
File "/usr/local/lib/python3.12/site-packages/jinja2/environment.py", line 939, in handle_exception
raise rewrite_traceback_stack(source=source)
File "<template>", line 1, in top-level template code
File "/usr/src/homeassistant/homeassistant/helpers/template.py", line 2355, in regex_findall_index
return regex_findall(value, find, ignorecase)[index]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
IndexError: list index out of range
Index 0 not found, but it has worked for months without any issue...
Using Webscraper (Chrome extension) shows the correct data however:
Well, the page has blocked the request. Maybe because you're pooling too often or whatever reason. It's all there if you read the raw data in the debug log.
<!DOCTYPE html>
<html style="height:100%"><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<meta content="noindex" name="robots"/>
<title> 403 Blocked
</title></head>
<body style="color: #444; margin:0;font: normal 14px/20px Arial, Helvetica, sans-serif; height:100%; background-color: #fff;">
<div style="height:auto; min-height:100%; "> <div style="text-align: center; width:800px; margin-left: -400px; position:absolute; top: 30px; left:50%;">
<h1 style="margin:0; font-size:150px; line-height:150px; font-weight:bold;">403</h1>
<h2 style="margin-top:20px;font-size: 30px;">Bot detection
</h2>
<p></p><h1>You were blocked from 78.22.217.119</h1>
<hr/>
<!--<h1>Reason: 0.d6f51202.1724317812.21df19c2</h1>-->
<ul style="list-style-type: none;">
<li>If you are using a VPN, please disable it or configure split tunnelling</li>
<li>Indien u een VPN gebruikt, gelieve deze te willen uitschakelen of de split tunneling te willen configureren</li>
<li>Si vous utilisez un VPN, veuillez le désactiver ou configurer le "split tunneling"</li>
</ul>
<hr/>
<h3>Contact support for more information:<br/></h3>
<iframe height="500" src="https://rossel.emsecure.net/optiext/optiextension.dll?ID=PbkPlhTYZtH_g_auj8bQ7OfcH_gdiLUtiHDT5WZlt8qrAA_5H6dpQAWJSkVmj4zvYmUBafklFzxRHdEdGQCQUWVQVR8Xx&ref=0.d6f51202.1724317812.21df19c2" title="BotManager Support" width="800"></iframe>
<p>
</p>
</div></div>
</body></html>
But I can access it with Chrome and other browsers without any problem?
Can the bot detection be fooled?
But I can access it with Chrome and other browsers without any problem?
Can the bot detection be fooled?
Probably, but:
Exactly this is one of the reasons why the scrape integrations has this warning at the integrations page:
As this is not a full-blown web scraper like scrapy, it will most likely only work with simple web pages and it can be time-consuming to get the right section.
The problem
Since a couple of weeks, my Scrape stopped working. I didn't change anything and AFAIK, the scraped wesbite didn't change neither. If I scrape using eg. Webscraper (Chrome) it still shows me the requested data.
The page I'm scraping with GET: https://www.tijd.be/customers/mediafin.be/funds_tijd/1423098/Fund/60052461/
I'm scraping these 2 elements:
Price: Selector = #container > header.clearfix.header-stats > div:nth-child(1) > span > span Value template = {{ value|replace(".","")|replace(",",".")|float(0) }}
Date: Selector = #container > header.clearfix.header-stats > div:nth-child(1) > label Value template = {{ as_timestamp(strptime((value|regex_findall_index(find='([0-9]+/[0-9]+/[0-9]+)',index=0, ignorecase=False)), "%d/%m/%Y")) | timestamp_custom('%d-%m-%Y') }}
What version of Home Assistant Core has the issue?
core-2024.8.2
What was the last working version of Home Assistant Core?
No response
What type of installation are you running?
Home Assistant OS
Integration causing the issue
Scrape
Link to integration documentation on our website
https://www.home-assistant.io/integrations/scrape
Diagnostics information
No response
Example YAML snippet
No response
Anything in the logs that might be useful for us?
AND
Additional information