home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.64k stars 30.79k forks source link

Scrape integration doesn't faithfully reproduce requests #108983

Closed rpatterson closed 8 months ago

rpatterson commented 9 months ago

The problem

I have a scrape sensor that was working and then became unavailable because the resource started responding with a CloudFlare challenge. I answered the challenge in my browser and used the browser's developer tools to copy a $ curl command to fully reproduce the working request. I updated the scrape sensor by putting each -H option from the curl command into a header: in the config and then quoting the value with single '...' quotes to preserve double quotes in the header values, but the sensor request was still getting the challenge. I ran the very same $ curl command inside the container using $ docker compose exec home-assistant curl ... and it succeeds without the challenge. So something in the scrape integration is not reproducing requests as faithfully as the copied $ curl command is.

What version of Home Assistant Core has the issue?

core-2024.1.5

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Container

Integration causing the issue

scrape

Link to integration documentation on our website

https://www.home-assistant.io/integrations/scrape/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

home-assistant[bot] commented 9 months ago

Hey there @fabaff, @gjohansson-st, mind taking a look at this issue as it has been labeled with an integration (scrape) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `scrape` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign scrape` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


scrape documentation scrape source (message by IssueLinks)

gjohansson-ST commented 8 months ago

Scrape is not a full fledged browser experience in that manner so I'm not surprised this might not be working. I would suggest in this case that you would use command_line instead using your curl command there to extract this to a sensor instead of using scrape.

Going to close this issue as it's not something that is to be resolved with scrape. Thanks 👍

rpatterson commented 8 months ago

Scrape is not a full fledged browser experience

Well the issue here isn't that it's not a full fledged browser experience, neither is $ curl -H .... The issue is that the scrape integration is acting less faithfully than another non-browser experience such as $ curl -H .... How specifically is scrape behaving differently than $ curl -H ...?

Going to close this issue as it's not something that is to be resolved with scrape.

I would think that doing away with any behavior that deviates from other standard non-browser tools would be well within scope for the scrape integration. I would understand it being low priority and closing this issue as not planned but closing as completed seems incorrect. At the very least, any such deviations should be documented.

gjohansson-ST commented 8 months ago

I pressed wrong hence as completed but I have changed that now.

Anyway scrape depends on beautiful Soup so I suggest you read their documentation for further questions on this.

rpatterson commented 8 months ago

Anyway scrape depends on beautiful Soup so I suggest you read their documentation for further questions on this.

This isn't a matter of identifying what in the response to scrape, which is what Beautiful Soup is used for, this is a matter of the HTTP request and response, @gjohansson-ST. AFAIK, Beautiful Soup doesn't offer any way to submit an HTTP request and retrieve the response in it's API, it only supports being given the response as an argument.

gjohansson-ST commented 8 months ago

Yeah, sorry, wasn't reading carefully enough before answering.

So scrape uses the rest integration which in turn uses httpx for the communication with the resource. I don't know that there should be any limitations to this in comparison with, as example, curl.

Is there any special characters or something inside this dictionary which might get manipulated somehow along the way? Can you share the config so it's possible to see or test from my end?

rpatterson commented 8 months ago

So scrape uses the rest integration which in turn uses httpx for the communication with the resource. I don't know that there should be any limitations to this in comparison with, as example, curl.

Yeah, I think that's the knowledge where any answer to this is going to come from.

Is there any special characters or something inside this dictionary which might get manipulated somehow along the way?

Yeah, or maybe in the YAML deserialization. That's why I described the process of converting the $ curl -H ... command into a scrape configuration, to include how the YAML is formed. I also automated this conversion using tools in my editor to rule out typos.

To inspect the resulting requests, I bumped up any logging levels I thought might be involved with:

logger:
  logs:
    httpx: "debug"
    homeassistant.components.rest: "debug"
    homeassistant.components.scrape: "debug"

But I don't see any output about the request that HA sends that allows me to inspect any resulting differences:

  1. I don't see any homeassistant.components.rest or httpx logging messages for my scrape sensors, only for my rest sensors. IOW, I only get homeassistant.components.scrape logger messages for my scrape sensors regardless of how the homeassistant.components.rest or httpx loggers are configured.
  2. Even for my rest sensors, there's nothing in the homeassistant.components.rest or httpx logging messages that would allow me to inspect the resulting HTTP requests.

How can I inspect the requests being sent to identify the difference as compared to a request sent by $ curl -H ...?

Can you share the config so it's possible to see or test from my end?

Well it's not much more than what I describe in the description and it contains private information such as authentication tokens so I can't include the exact header: values, which is likely where any configuration or user error would be. But here's an example:

scrape:
  - resource: "https://www.example.com/foo.html"
    headers:
      Authority: 'www.example.com'
      ...
      Cookie: 'uid=...; pass=...; cf_clearance=...'
      User-Agent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36'
      ...
    sensor:
      - unique_id: "example_foo"
        name: "Example Foo"
        select: >-
          #foo-elem
        state_class: "measurement"
        value_template: >-
          {% if not value is none %}
            {{ value.strip().split(" ", 1)[0]|int }}
          {% else %}
            0
          {% endif %}