Edu-Sharing v8.1+ provides a new API endpoint to query whitelisted metadata properties that should be "mixed into" the collected metadata during scraping of individual items. There are several requirements, so here's the gist of the program flow:
within the specified edu-sharing repository, there needs to exist a "Quellen-Datensatz" (crawler source dataset) available for the to-be-run crawler
e.g. if you want to mix in metadata for zum_klexikon_spider during a crawl process, you need to make sure that a learning object for this source exists, which contains two important properties:
the learning object needs to be a "Quellen-Datensatz" (identified by the edu-sharing property cclom:general_identifier, in this case: "cclom:general_identifier": "zum_klexikon_spider")
now that this learning object is recognized as a "Quellen-Datensatz", the metadata property ccm:oeh_crawler_data_inheritmust be available and should contain the desired (to be whitelisted) property names
the list of metadata properties found within ccm:oeh_crawler_data_inherit is used to attach the key-value pairs to BaseItem.custom early in the processing pipeline
if the crawler scrapes more suitable metadata properties, the "mixed in" / whitelisted metadata property gets overwritten by the more precise (individually scraped) metadata
this means that the whitelisted metadata properties basically act as a fallback for items where no precise metadata can be scraped, but otherwise don't interfere with the normal crawling process
"Quellen-Datensätze" (roughly translated to "crawler source datasets") are managed by editors with metadata expertise and kept up to date by humans. The metadata of a "Quellen-Datensatz" is typically much larger than the desired amount of metadata properties, which should be whitelisted and attached to each individual item, which is why only a subset (as defined in ccm:oeh_crawler_data_inherit) of properties are "mixed in" to individually scraped items.
How-To: .env-settings explained
Since not every edu-sharing instance might have a "Quellen-Datensatz" and the required ccm:oeh_crawler_data_inherit property available during runtime of a crawler, this feature is disabled by default and should only be enabled on a per-spider basis! You can control the behavior of this feature with two environment variables:
by default, this .env-setting is not active and commented-out (this equals EDU_SHARING_SOURCE_TEMPLATE_ENABLED=False
if you choose to explicitly enable this feature (EDU_SHARING_SOURCE_TEMPLATE_ENABLED=True):
the crawler pipeline will check if a "Quellen-Datensatz"-Template is available within your specified edu-sharing repository
and raise an error if retrieval of the whitelisted metadata properties failed for whatever reason and abort the crawling process
(the behavior of this feature in the program flow was intentionally chosen this way so that you cannot start a crawl with the "source template"-feature enabled and accidentally fail to attach the whitelisted properties.)
this optional setting is mainly intended for those cases, when the edu-sharing repository where the "Quellen-Datensatz"-Template resides in is different from the edu-sharing repository you're saving the individual items to
e.g. during debugging if you want to query the "Staging"-environment for a "Quellen-Datensatz"-Template, but the invidually scraped items are saved to the "pre-Staging"-dev-environment
if you don't explicitly set a value for this variable (= an edu-sharing repository URL), the EduSharingSourceTemplateHelper-class will try to fall back to your edu-sharing repository as specified in EDU_SHARING_BASE_URL and use that one instead
if neither EDU_SHARING_SOURCE_TEMPLATE_BASE_URL nor EDU_SHARING_BASE_URL contains a valid setting, this feature cannot work properly and will raise an error!
Addendum: Glossary
Since domain knowledge is required and the German terminology is hard to grasp in translations to English, I've attached a small lookup table to minimize possible (future) confusion during maintenance:
Begriff (German)
roughly translated to
Explanatory Note
"Lernobjekt"
"learning object"
a singular item within the edu-sharing repository
"Quelle"
"source"
a source of to-be-scraped items. a learning object which contains metadata about a source of items (required:ccm:oeh_lrt needs to be (sub-)type of "Quelle"!)
"Quellen-Datensatz"
"source dataset"
a singular (!) learning object within the specified edu-sharing repository that holds precise metadata about a "source" (= "Quelle"), which is linked to a crawler via cclom:general_identifier
"Quellen-Datensatz"-Template
"source template"
the whitelisted metadata properties as specified in ccm:oeh_crawler_data_inherit
This PR includes the following highlights:
EduSharingSourceTemplateHelper
utility class)Playwright
is used (see:converter/web_tools.py
) to crawl a website, wait until theload
-event is being fired by the websiteDOMContentLoaded
) before retrieving the HTML and taking a screenshotconverter/.env.example
with documentation regarding the newly implemented.env
-settings (see details below)Feature description: edu-sharing "source template" metadata properties whitelist ("Quellen-Datensatz"-Templates für erbende Daten)
Edu-Sharing v8.1+ provides a new API endpoint to query whitelisted metadata properties that should be "mixed into" the collected metadata during scraping of individual items. There are several requirements, so here's the gist of the program flow:
zum_klexikon_spider
during a crawl process, you need to make sure that a learning object for this source exists, which contains two important properties:cclom:general_identifier
, in this case:"cclom:general_identifier": "zum_klexikon_spider"
)ccm:oeh_crawler_data_inherit
must be available and should contain the desired (to be whitelisted) property namesccm:oeh_crawler_data_inherit
is used to attach the key-value pairs toBaseItem.custom
early in the processing pipeline"Quellen-Datensätze" (roughly translated to "crawler source datasets") are managed by editors with metadata expertise and kept up to date by humans. The metadata of a "Quellen-Datensatz" is typically much larger than the desired amount of metadata properties, which should be whitelisted and attached to each individual item, which is why only a subset (as defined in
ccm:oeh_crawler_data_inherit
) of properties are "mixed in" to individually scraped items.How-To:
.env
-settings explainedSince not every edu-sharing instance might have a "Quellen-Datensatz" and the required
ccm:oeh_crawler_data_inherit
property available during runtime of a crawler, this feature is disabled by default and should only be enabled on a per-spider basis! You can control the behavior of this feature with two environment variables:EDU_SHARING_SOURCE_TEMPLATE_ENABLED
(optional,bool
):.env
-setting is not active and commented-out (this equalsEDU_SHARING_SOURCE_TEMPLATE_ENABLED=False
EDU_SHARING_SOURCE_TEMPLATE_ENABLED=True
):EDU_SHARING_SOURCE_TEMPLATE_BASE_URL
(optional,str
):EduSharingSourceTemplateHelper
-class will try to fall back to your edu-sharing repository as specified inEDU_SHARING_BASE_URL
and use that one insteadEDU_SHARING_SOURCE_TEMPLATE_BASE_URL
norEDU_SHARING_BASE_URL
contains a valid setting, this feature cannot work properly and will raise an error!Addendum: Glossary
Since domain knowledge is required and the German terminology is hard to grasp in translations to English, I've attached a small lookup table to minimize possible (future) confusion during maintenance:
ccm:oeh_lrt
needs to be (sub-)type of "Quelle"!)cclom:general_identifier
ccm:oeh_crawler_data_inherit