PyExplorer commented 1 year ago

In some cases, we have a huge number of fields in one item (>500). Enabling SPIDERMON_ADD_FIELD_COVERAGE adds a record for every field in stats and this raises the issue like "Value exceeds max encoded size of 65536 bytes" in Scrapy Cloud dashboard.

The idea is to limit the number of records in the stats to only fields that we actually want to monitor in SPIDERMON_FIELD_COVERAGE_RULES.

codecov[bot] commented 1 year ago

Codecov Report

Merging #359 (2d64c78) into master (6af58c2) will decrease coverage by 0.07%. The diff coverage is 60.00%.

@@            Coverage Diff             @@
##           master     #359      +/-   ##
==========================================
- Coverage   74.03%   73.96%   -0.08%     
==========================================
  Files          72       72              
  Lines        3096     3103       +7     
  Branches      483      485       +2     
==========================================
+ Hits         2292     2295       +3     
- Misses        737      739       +2     
- Partials       67       69       +2

Impacted Files	Coverage Δ
spidermon/contrib/scrapy/extensions.py	`82.35% <60.00%> (-2.92%)`	:arrow_down:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

VMRuiz commented 1 year ago

I have mixed feelings here. On one hand, I understand this may be useful feature to reduce overall noise in stats. On the other hand, it looks like we are adding a pretty specific feature just to overcome a limitation on Scrapy Cloud. An alternative solution would be to override the stat export for Scrapy Cloud to ignore the non-desired fields from the spider.

I would like to heard more opinions about this. cc. @Gallaecio @kmike @rennerocha @wRAR

rennerocha commented 1 year ago

Adding the stats just of the fields that have a field coverage rule is not ideal. It won't prevent you to have the Scrapy Cloud limitation problem if your spider adds more stats unrelated to field coverage (I had a project like that when I worked at Zyte). It will be just a palliative solution as we can't guarantee that we will not have more than 65536 bytes of stats just removing the field coverage stats.

I know that this is annoying, but this PR is not a solution for a Scrapy or Spidermon limitation, but a solution for a third-party application limitation (as we may be running Scrapy spiders anywhere without that limitation). So, in my opinion, I agree with Victor, that this is a too specific feature just to overcome a problem in Scrapy Cloud.

Some alternatives solution I can see:

My preference - Every time we start a job in Scrapy Cloud, the default stats collector (STATS_CLASS) is overrided to sh_scrapy.stats.HubStorageStatsCollector. You could modify this stats collector to not persist the field coverage related stats to SC. That way the feature will be limited to a project that is 100% specific to people running the Spiders in Scrapy Cloud. It will also demand the less code change in Spidermon.
Remove the dependency of SPIDERMON_ADD_FIELD_COVERAGE setting from FieldCoverageMonitor monitor. This would require that we change how we keep the items returned, what we store in the stats and the data used by the monitor.

rennerocha commented 1 year ago

@PyExplorer just a nitpick, it would make easier for me to review code if you create the PR from a fork in your personal account (not from the main repository) :-)

Gallaecio commented 1 year ago

If we forget about Scrapy Cloud altogether, I think it may be a valid request to want control over which fields Spidermon tracks. From that point of view, I am not against going this route.

However, for the stated goal, I agree with @rennerocha‘s preferred solution, modifying the special STATS_CLASS that Scrapy Cloud uses to communicate stats to Scrapy Cloud may be the best path. We could extend the upstream class to support e.g. a list of regular expressions to exclude stats, for example; it should not be too difficult to implement, and it would be flexible enough to also support other instances of the issue unrelated to Spidermon.

rennerocha commented 1 year ago

Just removing these stats for this specific monitor will not solve the problem. We don't know if other monitors, custom stats or logs added by the spider developers will not surpass the Scrapy Cloud limit. Also, this is not a problem with Spidermon's use of the stats collection feature of Scrapy, but a limitation of a third-party system.

Considering that there is a solution that will require a change in Scrapy Cloud stats collector only, and as @Gallaecio mentioned, would open room for more fine-grained control there, I will close this PR.

scrapinghub / spidermon

Add new feature to limit number of records in stats for SPIDERMON_ADD_FIELD_COVERAGE setting. #359

Codecov Report