Disclaimer: I'm a Python beginner and have been using Scrapy for the past 5 months, so please bear with me since this is the first GitHub Issue I've ever written.
Problem Description
I've tried integrating spidermon into an existing codebase with ~40 crawlers that use scrapy.Items as their data model. Upon trying to integrate Item Validation (both via schematics and jsonschema) I've noticed that spidermon only seems to be able to "see" the first level of a scrapy.Item (class: scrapy.Item), but not the other scrapy.Item-classes that are nested within mentioned Item.
Source code examples
I've tried to illustrate the problem with a simplified abstraction that's close to the Scrapy Tutorial - here's my items.py:
# items.py
import scrapy
from scrapy import Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst
class JoinMultivalues(object):
def __init__(self, separator=u" "):
self.separator = separator
def __call__(self, values):
return values
class LicenseItem(scrapy.Item):
description = Field()
class QuotesItem(scrapy.Item):
text = Field()
author = Field()
tags = Field()
license = Field(output_processor=JoinMultivalues())
class QuotesItemLoader(ItemLoader):
default_item_class = QuotesItem
default_output_processor = TakeFirst()
class LicenseItemLoader(ItemLoader):
default_item_class = LicenseItem
default_output_processor = TakeFirst()
The main idea is: Within the QuotesItem there's a LicenseItem that should hold a license-description. Within the QuotesItem there could be other scrapy.Items nested within, sometimes several layers deep.
This is how a yielded Item looks like in the Terminal (please ignore the "raw" formatting of the license description string):
2021-09-29 14:03:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': ['Allen Saunders'],
'license': [{'description': ['[\'<p class="copyright">\\n Made with <span '
'class="sh-red">❤</span> by <a '
'href="https://scrapinghub.com">Scrapinghub</a>\\n '
"</p>']"]}],
'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans'],
'text': ['“Life is what happens to us while we are making other plans.”']}
The crawler itself uses the Basic Scrapy Monitors and doesn't do much else than scraping the Tutorial website, using the scrapy.ItemLoader-class to nest one scrapy.Item within another and yields the QuotesItem in the end.
My expectation/hope was that I'd be able to "look inside" the LicenseItem and spidermon would show me spidermon_item_scraped_count/QuotesItem/license/description. But as you can see above, spidermon stops at the depth of QuotesItem/license. Spidermon's field coverage monitor can't look inside my license-Item and therefore fails while trying to access the description-field with the following output:
2021-09-29 14:03:15 [quotes] ERROR: [Spidermon]
======================================================================
FAIL: Field Coverage Monitor/test_check_if_field_coverage_rules_are_met
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/criamos/PycharmProjects/spidermonTesting/venv/lib/python3.9/site-packages/spidermon/contrib/scrapy/monitors.py", line 275, in test_check_if_field_coverage_rules_are_met
self.assertTrue(len(failures) == 0, msg=msg)
AssertionError:
The following items did not meet field coverage rules:
QuotesItem/license/description (expected 1, got 0)
Item Validators - validators.py
(the commented-out parts were different approaches until I realized that the problem must lie elsewhere)
# validators.py
from schematics import Model
from schematics.types import *
class LicenseItemValidator(Model):
description = StringType()
class QuoteItemValidator(Model):
text = StringType(required=True)
author = StringType(required=True)
tags = ListType(StringType)
# license = DictType(field=StringType, coerce_key=BaseType)
# license = ModelType(model_spec=LicenseItemValidator, required=True)
license = ListType(ModelType(LicenseItemValidator))
Expected Behaviour:
If I yield a normal python dictionary (see "Implementation via dict class" in my quotes_spider-example above), I'll get the following output:
Which works as expected. The nested dictionaries are accessible by spidermon.
Now I'm all out of ideas since the only approach I currently see as a solution is to "flatten" all the sub-Items into a big scrapy.Item-structure. This could totally be my fault and I'm simply using spidermon and schematics wrong here, but if anyone could confirm or deny if this is intended behaviour or not, it would be really appreciated. Thank you in advance for taking the time to read this wall of text (and thank you for developing scrapy / spidermon!)
@Criamos I think it is expected. As of now spidermon does not support nested items inside a list such as license field in your case. We can not apply coverage rules on license. description
Disclaimer: I'm a Python beginner and have been using Scrapy for the past 5 months, so please bear with me since this is the first GitHub Issue I've ever written.
Problem Description
I've tried integrating spidermon into an existing codebase with ~40 crawlers that use
scrapy.Item
s as their data model. Upon trying to integrate Item Validation (both viaschematics
andjsonschema
) I've noticed that spidermon only seems to be able to "see" the first level of ascrapy.Item
(class:scrapy.Item
), but not the otherscrapy.Item
-classes that are nested within mentionedItem
.Source code examples
I've tried to illustrate the problem with a simplified abstraction that's close to the Scrapy Tutorial - here's my
items.py
:The main idea is: Within the
QuotesItem
there's aLicenseItem
that should hold a license-description. Within theQuotesItem
there could be otherscrapy.Item
s nested within, sometimes several layers deep.This is how a yielded Item looks like in the Terminal (please ignore the "raw" formatting of the license description string):
The crawler itself uses the
Basic Scrapy Monitors
and doesn't do much else than scraping the Tutorial website, using thescrapy.ItemLoader
-class to nest onescrapy.Item
within another and yields theQuotesItem
in the end.Output examples
This is what the field coverage output looks like:
My expectation/hope was that I'd be able to "look inside" the
LicenseItem
and spidermon would show mespidermon_item_scraped_count/QuotesItem/license/description
. But as you can see above, spidermon stops at the depth ofQuotesItem/license
. Spidermon's field coverage monitor can't look inside mylicense
-Item and therefore fails while trying to access thedescription
-field with the following output:Item Validators - validators.py
(the commented-out parts were different approaches until I realized that the problem must lie elsewhere)
Expected Behaviour:
If I yield a normal python dictionary (see "Implementation via dict class" in my
quotes_spider
-example above), I'll get the following output:Which works as expected. The nested dictionaries are accessible by spidermon.
Now I'm all out of ideas since the only approach I currently see as a solution is to "flatten" all the sub-Items into a big
scrapy.Item
-structure. This could totally be my fault and I'm simply usingspidermon
andschematics
wrong here, but if anyone could confirm or deny if this is intended behaviour or not, it would be really appreciated. Thank you in advance for taking the time to read this wall of text (and thank you for developing scrapy / spidermon!)