Closed rpatterson closed 8 months ago
Hey there @fabaff, @gjohansson-st, mind taking a look at this issue as it has been labeled with an integration (scrape
) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)
scrape documentation scrape source (message by IssueLinks)
Scrape is not a full fledged browser experience in that manner so I'm not surprised this might not be working.
I would suggest in this case that you would use command_line
instead using your curl
command there to extract this to a sensor instead of using scrape
.
Going to close this issue as it's not something that is to be resolved with scrape
. Thanks 👍
Scrape is not a full fledged browser experience
Well the issue here isn't that it's not a full fledged browser experience, neither is $ curl -H ...
. The issue is that the scrape
integration is acting less faithfully than another non-browser experience such as $ curl -H ...
. How specifically is scrape behaving differently than $ curl -H ...
?
Going to close this issue as it's not something that is to be resolved with
scrape
.
I would think that doing away with any behavior that deviates from other standard non-browser tools would be well within scope for the scrape
integration. I would understand it being low priority and closing this issue as not planned
but closing as completed
seems incorrect. At the very least, any such deviations should be documented.
I pressed wrong hence as completed but I have changed that now.
Anyway scrape
depends on beautiful Soup so I suggest you read their documentation for further questions on this.
Anyway
scrape
depends on beautiful Soup so I suggest you read their documentation for further questions on this.
This isn't a matter of identifying what in the response to scrape, which is what Beautiful Soup is used for, this is a matter of the HTTP request and response, @gjohansson-ST. AFAIK, Beautiful Soup doesn't offer any way to submit an HTTP request and retrieve the response in it's API, it only supports being given the response as an argument.
Yeah, sorry, wasn't reading carefully enough before answering.
So scrape
uses the rest
integration which in turn uses httpx
for the communication with the resource.
I don't know that there should be any limitations to this in comparison with, as example, curl
.
Is there any special characters or something inside this dictionary which might get manipulated somehow along the way? Can you share the config so it's possible to see or test from my end?
So
scrape
uses therest
integration which in turn useshttpx
for the communication with the resource. I don't know that there should be any limitations to this in comparison with, as example,curl
.
Yeah, I think that's the knowledge where any answer to this is going to come from.
Is there any special characters or something inside this dictionary which might get manipulated somehow along the way?
Yeah, or maybe in the YAML deserialization. That's why I described the process of converting the $ curl -H ...
command into a scrape
configuration, to include how the YAML is formed. I also automated this conversion using tools in my editor to rule out typos.
To inspect the resulting requests, I bumped up any logging levels I thought might be involved with:
logger:
logs:
httpx: "debug"
homeassistant.components.rest: "debug"
homeassistant.components.scrape: "debug"
But I don't see any output about the request that HA sends that allows me to inspect any resulting differences:
homeassistant.components.rest
or httpx
logging messages for my scrape
sensors, only for my rest
sensors. IOW, I only get homeassistant.components.scrape
logger messages for my scrape
sensors regardless of how the homeassistant.components.rest
or httpx
loggers are configured.rest
sensors, there's nothing in the homeassistant.components.rest
or httpx
logging messages that would allow me to inspect the resulting HTTP requests.How can I inspect the requests being sent to identify the difference as compared to a request sent by $ curl -H ...
?
Can you share the config so it's possible to see or test from my end?
Well it's not much more than what I describe in the description and it contains private information such as authentication tokens so I can't include the exact header:
values, which is likely where any configuration or user error would be. But here's an example:
scrape:
- resource: "https://www.example.com/foo.html"
headers:
Authority: 'www.example.com'
...
Cookie: 'uid=...; pass=...; cf_clearance=...'
User-Agent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36'
...
sensor:
- unique_id: "example_foo"
name: "Example Foo"
select: >-
#foo-elem
state_class: "measurement"
value_template: >-
{% if not value is none %}
{{ value.strip().split(" ", 1)[0]|int }}
{% else %}
0
{% endif %}
The problem
I have a
scrape
sensor that was working and then became unavailable because the resource started responding with a CloudFlare challenge. I answered the challenge in my browser and used the browser's developer tools to copy a$ curl
command to fully reproduce the working request. I updated thescrape
sensor by putting each-H
option from the curl command into aheader:
in the config and then quoting the value with single'...'
quotes to preserve double quotes in the header values, but the sensor request was still getting the challenge. I ran the very same$ curl
command inside the container using$ docker compose exec home-assistant curl ...
and it succeeds without the challenge. So something in thescrape
integration is not reproducing requests as faithfully as the copied$ curl
command is.What version of Home Assistant Core has the issue?
core-2024.1.5
What was the last working version of Home Assistant Core?
No response
What type of installation are you running?
Home Assistant Container
Integration causing the issue
scrape
Link to integration documentation on our website
https://www.home-assistant.io/integrations/scrape/
Diagnostics information
No response
Example YAML snippet
No response
Anything in the logs that might be useful for us?
No response
Additional information
No response