openeduhub / metalookup

Provide metadata about domains w.r.t accessibility, licencing, adds, etc.
GNU General Public License v3.0
5 stars 0 forks source link

Incomplete, missing, or low quality `cclom:general_description` attributes #128

Open MRuecklCC opened 2 years ago

MRuecklCC commented 2 years ago

One feature that is requested from the users is the possibility to e.g. generate keywords, a title, or the assignment to the different SKOs vocabularies from a provided description (cclom:general_descriotion) of the content.

No matter the chosen approach, two problems need to be addressed to implement that.

The description needs to be made available to the service

To extract keywords from a description, the service first needs the description. Two options would be:

I think approach a) or c) are the viable options:

One downside of option c) is we might redo work that was already done when the content was crawled initially.

The quality of the description is often insufficient

Examples:

Unsere Partnerportale finde Sie unter: https://wirlernenonline.de/portal/zukunfts-und-berufsorientierung/ https://wirlernenonline.de/portal/lernen-lernen/ """ ]


- just keywords

```json
"cclom:general_description" : [
              """climate change, Kyoto protocol, Carbon footprint, global warming, environmental crisis, drought refugees, sea-level rise
"""
            ]
MRuecklCC commented 2 years ago

It would also be good to know, why the cclom:general_description field is a list typed attribute, and not a simple string type.

MRuecklCC commented 2 years ago

A couple of ideas for validating that a provided text may actually be a viable description:

These could be validated whenever a request is received to increase likelihood that the generated keywords, title, or classification actually make sense. which long term could indirectly improve the quality of the description fields of already indexed content - i.e. these constraints could be "propagated upstream" into the WLO metadata profile.

MRuecklCC commented 2 years ago

As can be seen from #131 today's state of the art NLP models are quite capable of generating adequate titles from reasonable inputs.

RobertMeissner commented 2 years ago

The key point - which got me tripping - here is, that the created description is a suggestion to the editors, so no need to consider their freetext descriptions.

MRuecklCC commented 1 year ago

See also #146