scrapinghub / spidermon

Scrapy Extension for monitoring spiders execution.
https://spidermon.readthedocs.io
BSD 3-Clause "New" or "Revised" License
528 stars 96 forks source link

skip_if decorator for monitors #339

Open rennerocha opened 2 years ago

rennerocha commented 2 years ago

This came from a real use case:

FieldCoverageMonitor requires that we have at least one item returned, or else it will fail. However there are situation when we allow a spider to return no items. Think about a real estate website searching for locations in a certain neighborhood, sometimes even with the right code, we don't have any result, so we should not fail our monitors.

This is not possible with our built-in monitors, and the solution is create a new custom monitor, base in a built-in monitor that implements this check before doing the test.

pytest library has a skipif decorator (https://docs.pytest.org/en/6.2.x/skipping.html#id1) where you can set some rules and if these rules are not met, the test are not executed. We could consider to add something like that in Spidermon so we can decide if we want to execute or skip a test considering a condition (check a stat value could be enough).

Some options to accomplish that:

Opening this issue to get more ideas and suggestions if this is a good idea and if so, what is the best solution for that.

shafiq-muhammad commented 1 year ago

Hello @rennerocha Thank you for the explanation and the suggested solution. Just for my better understanding, can you please explain a bit more that if we can achieve this by using update_settings method in the spider? Like if a spider had not produced any item then we can remove the FieldCoverageMonitor monitor using update_settings method. Why we should have a solution in Spidermon itself?

Thank you.

VMRuiz commented 1 year ago

Wouldn't it be better to just disable this monitor at all for Spiders that may yield 0 items?

The only advantage that I see to the skipif functionally is that the spider can create checkpoints indicating that it tried to extract items but none was found vs the spider failing at the begining of the crawl and closing before finding any item. But that case would probably detected by ERROR_LOGs monitor or similar.

rennerocha commented 1 year ago

Hello @rennerocha Thank you for the explanation and the suggested solution. Just for my better understanding, can you please explain a bit more that if we can achieve this by using update_settings method in the spider?

This can't be achieved using update_settings as we will only know that the monitor/monitorsuite should/shouldn't be executed in runtime. update_settings is applied only when the spider is started.

rennerocha commented 1 year ago

Wouldn't it be better to just disable this monitor at all for Spiders that may yield 0 items?

If the spider yield more than 1 item, I do want the monitor to be executed. But the spider may eventually yield 0 items so I don't want the monitors to be executed in that case. I can't simply disable it.

rennerocha commented 1 year ago

@VMRuiz When I thought about the skip_ifsolution, my primary goal was to allow the user define any conditional rule to skip the monitor, not only based on the number of items scraped. For example, if we have a monitor that access a database to get some information that will be used inside it, but we don't want that costly procedure to be executed if some specific stat is not set (or set to a specific value), we could skip it using this decorator.

If not implemented, we still have the problem with the FieldCoverageMonitor as it is a scenario that happened in a project with dozens of spiders running. One possible solution (that looks much simpler to implement) would be add a new setting called SPIDERMON_FIELD_COVERAGE_SKIP_IF_NO_ITEM (or something similar). Is set to True, if no item is returned, the field coverage will not be executed, otherwise field coverage will be checked. I opened #366 so we can discuss this specific scenario there.

rennerocha commented 1 year ago

If something is going to be implemented in Spidermon, #366 looks much more important with real use cases than this issue.

VMRuiz commented 1 year ago

Wouldn't it be better to just disable this monitor at all for Spiders that may yield 0 items?

If the spider yield more than 1 item, I do want the monitor to be executed. But the spider may eventually yield 0 items so I don't want the monitors to be executed in that case. I can't simply disable it.

Maybe in just focusing too much in a technicism here but, what is the difference between skippipng a monitor and not executing it? Why would you want to have a monitor that will always be ok?

For example, if we have a monitor that access a database to get some information that will be used inside it, but we don't want that costly procedure to be executed if some specific stat is not set (or set to a specific value), we could skip it using this decorator.

This user case looks much more interesting.

rennerocha commented 1 year ago

Maybe in just focusing too much in a technicism here but, what is the difference between skipping a monitor and not executing it? Why would you want to have a monitor that will always be ok?

I am not sure if you really understood the use-case. One thing is disable a monitor, so it will never be executed. Other scenario is, if a certain condition is met (i.e. no items returned, a certain number of errors happen, a finished reason is something different for a certain value, etc) the monitor is not executed (skipped), but if the condition is not met, the monitor is executed.

These conditions are defined in runtime, so we can not disable the monitor beforehand.

In the end, monitors are just like test cases, so the definition for skipping a test in pytest documentation is valid here (replace test with monitor and it should be clearer):

A skip means that you expect your test to pass only if some conditions are met, otherwise pytest should skip running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping tests that depend on an external resource which is not available at the moment (for example a database).

curita commented 1 year ago

Some options to accomplish that:

  • Create a skipif decorator that can be applied to a MonitorSuite class, so the entire suite will not be executed whether the condition is not met. This looks simpler, as we can use our existing built-in monitors as-it-is now and the control is
  • Create a skipif decorator for the test_ methods, so that method is not executed whether the condition is met. This has the drawback that when we want to use a built-in monitor, we will need to create a custom monitor just to add the decorator

I've been thinking about it, and after the last consideration, I'm starting to understand more about what Muhammad has started implementing at https://github.com/scrapinghub/spidermon/pull/384.

It would be nice to have multiple options: to disable full monitor classes, specific test_* methods, or whole suites. But to disable specific monitors, we need to extend the class to add the decorator, as mentioned in the ticket's description. And in that situation, nothing prevents us from codifying the skip condition in the method itself. We've done most of the job already.

But if we go away with implementing it as a decorator, we could add it as this new SPIDERMON_MONITOR_SKIPPING_RULES setting.

That way, we could add skip rules for built-in monitors without having to extend them.

This is mostly @VMRuiz and @shafiq-muhammad's idea as I understand it but I wanted to post it here for context and to say that I like it 😅

VMRuiz commented 4 months ago

Do we still need this decorator or is it just enough with the current SPIDERMON_MONITOR_SKIPPING_RULES?

curita commented 4 months ago

@VMRuiz I think SPIDERMON_MONITOR_SKIPPING_RULES should be enough for now 👍 The setting is not easy to use though but we can work on usability improvements as separate tickets and close this one.

For reference, we talked about some details in this (private) slack thread about how to use it a few months ago, but the bottom line was that:

It was a bit harder to make it work as expected. Probably those docs should be updated when they are available on ReadTheDocs:

  1. The Spidermon version needs to be >=1.20.0
  2. It seems it disables whole monitor classes. a. This might be inconvenient if you want to disable a specific test function inside a monitor class. Maybe support for disabling specific test functions could be added later on.
  3. The monitor class needs to inherit from BaseScrapyMonitor
  4. The monitor suite needs to inherit from SpiderCloseMonitorSuite
  5. Then the SPIDERMON_MONITOR_SKIPPING_RULES setting should work as expected