openeduhub / oeh-search-etl

The Backend includes all data for the ETL process (Scrapy, Postgres, Elasticsearch)
7 stars 9 forks source link

feat: attach whitelisted edu-sharing "source template" metadata properties to scraped items ("Quellen-Datensatz"-Template) #99

Closed Criamos closed 9 months ago

Criamos commented 9 months ago

This PR includes the following highlights:

Feature description: edu-sharing "source template" metadata properties whitelist ("Quellen-Datensatz"-Templates für erbende Daten)

Edu-Sharing v8.1+ provides a new API endpoint to query whitelisted metadata properties that should be "mixed into" the collected metadata during scraping of individual items. There are several requirements, so here's the gist of the program flow:

"Quellen-Datensätze" (roughly translated to "crawler source datasets") are managed by editors with metadata expertise and kept up to date by humans. The metadata of a "Quellen-Datensatz" is typically much larger than the desired amount of metadata properties, which should be whitelisted and attached to each individual item, which is why only a subset (as defined in ccm:oeh_crawler_data_inherit) of properties are "mixed in" to individually scraped items.

How-To: .env-settings explained

Since not every edu-sharing instance might have a "Quellen-Datensatz" and the required ccm:oeh_crawler_data_inherit property available during runtime of a crawler, this feature is disabled by default and should only be enabled on a per-spider basis! You can control the behavior of this feature with two environment variables:

Addendum: Glossary

Since domain knowledge is required and the German terminology is hard to grasp in translations to English, I've attached a small lookup table to minimize possible (future) confusion during maintenance:

Begriff (German) roughly translated to Explanatory Note
"Lernobjekt" "learning object" a singular item within the edu-sharing repository
"Quelle" "source" a source of to-be-scraped items. a learning object which contains metadata about a source of items (required: ccm:oeh_lrt needs to be (sub-)type of "Quelle"!)
"Quellen-Datensatz" "source dataset" a singular (!) learning object within the specified edu-sharing repository that holds precise metadata about a "source" (= "Quelle"), which is linked to a crawler via cclom:general_identifier
"Quellen-Datensatz"-Template "source template" the whitelisted metadata properties as specified in ccm:oeh_crawler_data_inherit