Very long initialisation of the Scrape component during HA restart

Spirituss commented 2 years ago

The problem

It seems inadequate that initialisation of the component, which has only 10 sensors in my HA configuration, takes more than 200 seconds (!!!) on Intel Celeron J3455 which doesn't not loaded with any other tasks.

Снимок экрана 2022-06-03 в 11 28 19

What version of Home Assistant Core has the issue?

core-2022.5.4

What was the last working version of Home Assistant Core?

None

What type of installation are you running?

Home Assistant Container

Integration causing the issue

Scrape

Link to integration documentation on our website

https://www.home-assistant.io/integrations/scrape

Diagnostics information

The example of my scrape sensors:

  - name: pollen_domination_pollenclub 
    resource: 'https://pollen.club'
    platform: scrape
    select: 'div[class$="main-data--stats-numbers"]'
    index: 0
    value_template: '{{ (value | default("No data")) if value else "No data" }}'
    scan_interval: 3600

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

probot-home-assistant[bot] commented 2 years ago

Hey there @fabaff, mind taking a look at this issue as it has been labeled with an integration (scrape) you are listed as a code owner for? Thanks! _{^{(message by CodeOwnersMention)}}

scrape documentation scrape source _{^{(message by IssueLinks)}}

fabaff commented 2 years ago

Your sensor example requires around 19 s when I run it on a Raspberry 4. The required time to load the page in a browser is 22 s for me. Not sure how much the required time for the request is influencing the startup time here.

My guess is that the 200 s are coming from a glitch in the communication.

gjohansson-ST commented 2 years ago

Should be something in the logs I would guess.

fabaff commented 2 years ago

You could give it a try with a more verbose log level but I doubt that this is helpful because the request is successful. Pollen.club seems to have an API according to this page. If they allow access then a rest sensor could retrieve the data in a more efficient way (e.g., avoiding almost 300 request to get all content and skipping the download of most 20 MB).

Spirituss commented 2 years ago

You could give it a try with a more verbose log level but I doubt that this is helpful because the request is successful. Pollen.club seems to have an API according to this [page]. If they allow access then a rest sensor could retrieve the data in a more efficient way (e.g., avoiding almost 300 request to get all content and skipping the download of most 20 MB).

I've contacted them, unfortunately, they didn't have any open API. But as for "heavy" sites - it can be the solution not to download the whole site content but just html part without media. Moreover, ad links and user counters can also be skipped. Is it possible to implement?

github-actions[bot] commented 2 years ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

home-assistant / core