Closed PyExplorer closed 1 year ago
Merging #359 (2d64c78) into master (6af58c2) will decrease coverage by
0.07%
. The diff coverage is60.00%
.
@@ Coverage Diff @@
## master #359 +/- ##
==========================================
- Coverage 74.03% 73.96% -0.08%
==========================================
Files 72 72
Lines 3096 3103 +7
Branches 483 485 +2
==========================================
+ Hits 2292 2295 +3
- Misses 737 739 +2
- Partials 67 69 +2
Impacted Files | Coverage Δ | |
---|---|---|
spidermon/contrib/scrapy/extensions.py | 82.35% <60.00%> (-2.92%) |
:arrow_down: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
I have mixed feelings here. On one hand, I understand this may be useful feature to reduce overall noise in stats. On the other hand, it looks like we are adding a pretty specific feature just to overcome a limitation on Scrapy Cloud. An alternative solution would be to override the stat export for Scrapy Cloud to ignore the non-desired fields from the spider.
I would like to heard more opinions about this. cc. @Gallaecio @kmike @rennerocha @wRAR
Adding the stats just of the fields that have a field coverage rule is not ideal. It won't prevent you to have the Scrapy Cloud limitation problem if your spider adds more stats unrelated to field coverage (I had a project like that when I worked at Zyte). It will be just a palliative solution as we can't guarantee that we will not have more than 65536 bytes of stats just removing the field coverage stats.
I know that this is annoying, but this PR is not a solution for a Scrapy or Spidermon limitation, but a solution for a third-party application limitation (as we may be running Scrapy spiders anywhere without that limitation). So, in my opinion, I agree with Victor, that this is a too specific feature just to overcome a problem in Scrapy Cloud.
Some alternatives solution I can see:
My preference - Every time we start a job in Scrapy Cloud, the default stats collector (STATS_CLASS
) is overrided to sh_scrapy.stats.HubStorageStatsCollector. You could modify this stats collector to not persist the field coverage related stats to SC. That way the feature will be limited to a project that is 100% specific to people running the Spiders in Scrapy Cloud. It will also demand the less code change in Spidermon.
Remove the dependency of SPIDERMON_ADD_FIELD_COVERAGE
setting from FieldCoverageMonitor
monitor. This would require that we change how we keep the items returned, what we store in the stats and the data used by the monitor.
@PyExplorer just a nitpick, it would make easier for me to review code if you create the PR from a fork in your personal account (not from the main repository) :-)
If we forget about Scrapy Cloud altogether, I think it may be a valid request to want control over which fields Spidermon tracks. From that point of view, I am not against going this route.
However, for the stated goal, I agree with @rennerocha‘s preferred solution, modifying the special STATS_CLASS
that Scrapy Cloud uses to communicate stats to Scrapy Cloud may be the best path. We could extend the upstream class to support e.g. a list of regular expressions to exclude stats, for example; it should not be too difficult to implement, and it would be flexible enough to also support other instances of the issue unrelated to Spidermon.
Just removing these stats for this specific monitor will not solve the problem. We don't know if other monitors, custom stats or logs added by the spider developers will not surpass the Scrapy Cloud limit. Also, this is not a problem with Spidermon's use of the stats collection feature of Scrapy, but a limitation of a third-party system.
Considering that there is a solution that will require a change in Scrapy Cloud stats collector only, and as @Gallaecio mentioned, would open room for more fine-grained control there, I will close this PR.
In some cases, we have a huge number of fields in one item (>500). Enabling SPIDERMON_ADD_FIELD_COVERAGE adds a record for every field in stats and this raises the issue like "Value exceeds max encoded size of 65536 bytes" in Scrapy Cloud dashboard.
The idea is to limit the number of records in the stats to only fields that we actually want to monitor in SPIDERMON_FIELD_COVERAGE_RULES.