scrapinghub / spidermon

Scrapy Extension for monitoring spiders execution.
https://spidermon.readthedocs.io
BSD 3-Clause "New" or "Revised" License
526 stars 94 forks source link

Support for list Field Coverage #391

Closed mrwbarg closed 1 year ago

mrwbarg commented 1 year ago

Closes #390.

When a field scraped by a spider is a list containing objects, there's no way to set thresholds for those fields. This PR adds support for correctly couting and calculating the coverage for those types of fields both at the top level of the item and inside nested structures.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 95.23% and project coverage change: +0.09 :tada:

Comparison is base (0cf783f) 76.44% compared to head (2d4489d) 76.54%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #391 +/- ## ========================================== + Coverage 76.44% 76.54% +0.09% ========================================== Files 76 76 Lines 3197 3214 +17 Branches 379 384 +5 ========================================== + Hits 2444 2460 +16 Misses 683 683 - Partials 70 71 +1 ``` | [Impacted Files](https://codecov.io/gh/scrapinghub/spidermon/pull/391?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapinghub) | Coverage Δ | | |---|---|---| | [spidermon/contrib/scrapy/monitors/monitors.py](https://codecov.io/gh/scrapinghub/spidermon/pull/391?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapinghub#diff-c3BpZGVybW9uL2NvbnRyaWIvc2NyYXB5L21vbml0b3JzL21vbml0b3JzLnB5) | `97.89% <ø> (ø)` | | | [spidermon/contrib/scrapy/extensions.py](https://codecov.io/gh/scrapinghub/spidermon/pull/391?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapinghub#diff-c3BpZGVybW9uL2NvbnRyaWIvc2NyYXB5L2V4dGVuc2lvbnMucHk=) | `85.57% <90.00%> (+0.16%)` | :arrow_up: | | [spidermon/utils/field\_coverage.py](https://codecov.io/gh/scrapinghub/spidermon/pull/391?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapinghub#diff-c3BpZGVybW9uL3V0aWxzL2ZpZWxkX2NvdmVyYWdlLnB5) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

mrwbarg commented 1 year ago

I think this feature is interesting, but I'm concerned about the impact that iterating over each array of an item might have on performance. Have you tested this patch by running some Scrapy jobs with complex or larger items to see if they perform well compared to the basic version?

I haven't run any test jobs. But this will definitely impact performance. For example, if we assume all items in a job have one field which is a list of m objects and the job scraped n items. The overall complexity for this would be O(mn) while previously it would have been just O(n). If m is big enough, the impact will definitely be felt. Of course, this gets worse if there are more lists inside the objects that are on the topmost list (this can be avoided though).

All the ways I can think of doing this (using a Counter for example) would still require the entire list to be traversed for each yielded item. Do you have any implementations suggestions that might mitigate this issue?

Maybe we can have it as a setting and inform the user of the possible performance impact?

mrwbarg commented 1 year ago

Updated it so now it is enabled through a setting and the coverage nesting levels can be set.

Gallaecio commented 1 year ago

If I understand correctly, this is disabled by default. If so, we can leave it up to users to decide whether they are willing to enable this at the cost of the corresponding performance hit. Maybe mentioning the potential performance hit in the setting documentation is enough.

mrwbarg commented 1 year ago

If I understand correctly, this is disabled by default. If so, we can leave it up to users to decide whether they are willing to enable this at the cost of the corresponding performance hit. Maybe mentioning the potential performance hit in the setting documentation is enough.

yup, you're correct

curita commented 1 year ago

Hi all! To make sure, is anything blocking this PR from getting approved? There's a comment in the docs from this PR about the performance impact of enabling this setting. I think that's covered now.