Closed mrwbarg closed 1 year ago
Patch coverage: 95.23
% and project coverage change: +0.09
:tada:
Comparison is base (
0cf783f
) 76.44% compared to head (2d4489d
) 76.54%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
I think this feature is interesting, but I'm concerned about the impact that iterating over each array of an item might have on performance. Have you tested this patch by running some Scrapy jobs with complex or larger items to see if they perform well compared to the basic version?
I haven't run any test jobs. But this will definitely impact performance. For example, if we assume all items in a job have one field which is a list of m
objects and the job scraped n
items. The overall complexity for this would be O(mn)
while previously it would have been just O(n)
. If m
is big enough, the impact will definitely be felt. Of course, this gets worse if there are more lists inside the objects that are on the topmost list (this can be avoided though).
All the ways I can think of doing this (using a Counter
for example) would still require the entire list to be traversed for each yielded item. Do you have any implementations suggestions that might mitigate this issue?
Maybe we can have it as a setting and inform the user of the possible performance impact?
Updated it so now it is enabled through a setting and the coverage nesting levels can be set.
If I understand correctly, this is disabled by default. If so, we can leave it up to users to decide whether they are willing to enable this at the cost of the corresponding performance hit. Maybe mentioning the potential performance hit in the setting documentation is enough.
If I understand correctly, this is disabled by default. If so, we can leave it up to users to decide whether they are willing to enable this at the cost of the corresponding performance hit. Maybe mentioning the potential performance hit in the setting documentation is enough.
yup, you're correct
Hi all! To make sure, is anything blocking this PR from getting approved? There's a comment in the docs from this PR about the performance impact of enabling this setting. I think that's covered now.
Closes #390.
When a field scraped by a spider is a list containing objects, there's no way to set thresholds for those fields. This PR adds support for correctly couting and calculating the coverage for those types of fields both at the top level of the item and inside nested structures.