Incomplete, missing, or low quality `cclom:general_description` attributes

MRuecklCC commented 2 years ago

One feature that is requested from the users is the possibility to e.g. generate keywords, a title, or the assignment to the different SKOs vocabularies from a provided description (cclom:general_descriotion) of the content.

No matter the chosen approach, two problems need to be addressed to implement that.

The description needs to be made available to the service

To extract keywords from a description, the service first needs the description. Two options would be:

a) the description is transmitted together with the request for extraction
b) the description is queried from the source system (edusharing or elastic) using e.g. the URL as key to identify the material.
c) the description is obtained from the URL, i.e. scraped from the actual content.

I think approach a) or c) are the viable options:

they allows to use the service for arbitrary descriptions (they don't have to be in the source system)
requirements on the descriptions can be placed, checked, and documented with the extraction endpoint.

One downside of option c) is we might redo work that was already done when the content was crawled initially.

The quality of the description is often insufficient

Many materials do not have a cclom:general_description at all or just contains an empty string.
Often, the description is a cut of version of the fulltext of the content and e.g. ends mid-sentence or with a ...
Sometimes the description just contains a list of keywords.
Sometimes the description contains references to figures, or just a list of links
Sometimes the description is related to the top-level domain (TLD) from where the content was crawled and not the actual resource within that TLD.

Examples:

problematic formatting / incomplete sentences

"cclom:general_description" : [
          """Eine sehr umfangreiche Simulation zum Treibhauseffekt bietet das PhET-Projekt der University of Colorado.

Neben der Einstellmöglichkeit für die Konzentration der Treibhausgase, kann man auch noch Wolken einblenden und deren Einfluss auf die Temperatur an der Erdoberfläche studieren. Außer…"""
        ]

mixed with html / script blocks

"cclom:general_description" : [
          """var phet_source ='https://phet.colorado.edu/sims/html/capacitor-lab-basics/latest/capacitor-lab-basics_de.html '

  Abb. 1
Erkunde, wie ein Kondensator funktioniert…"""
        ]
      }

obvious junk content

"cclom:general_description" : [
          "wqqeqwe q"
        ]

mix of actual content and link list


"cclom:general_description" : [
          """Wichtige Aufgaben von Lehrkräften und ganz besonders von Eltern sind, unsere Kinder das „Lernen zu lehren“, sie über geeignete Projekte in die Umsetzung und Anwendung des Gelernten zu bringen, Ihr Selbstvertrauen durch Erfolg zu stärken, Sie neugierig zu halten, um sie letztendlich bei der Findung Ihrer persönliche Berufung und Neigungen zu coachen?

Unsere Partnerportale finde Sie unter: https://wirlernenonline.de/portal/zukunfts-und-berufsorientierung/ https://wirlernenonline.de/portal/lernen-lernen/ """ ]


- just keywords

```json
"cclom:general_description" : [
              """climate change, Kyoto protocol, Carbon footprint, global warming, environmental crisis, drought refugees, sea-level rise
"""
            ]

MRuecklCC commented 2 years ago

It would also be good to know, why the cclom:general_description field is a list typed attribute, and not a simple string type.

MRuecklCC commented 2 years ago

A couple of ideas for validating that a provided text may actually be a viable description:

minimum and maximum character, word, and sentence count constraints
no trailing or leading whitespace
no multiple newlines or chained spaces within the text.
reasonable punctuation
eventually a first step could be to use an ML model to detect whether the received text actually is a description of some sort.

These could be validated whenever a request is received to increase likelihood that the generated keywords, title, or classification actually make sense. which long term could indirectly improve the quality of the description fields of already indexed content - i.e. these constraints could be "propagated upstream" into the WLO metadata profile.

MRuecklCC commented 2 years ago

As can be seen from #131 today's state of the art NLP models are quite capable of generating adequate titles from reasonable inputs.

RobertMeissner commented 2 years ago

The key point - which got me tripping - here is, that the created description is a suggestion to the editors, so no need to consider their freetext descriptions.

MRuecklCC commented 1 year ago

openeduhub / metalookup

Incomplete, missing, or low quality `cclom:general_description` attributes #128

The description needs to be made available to the service

The quality of the description is often insufficient