Align frontend data model with Metadata names of edusharing service.

MRuecklCC commented 2 years ago

Currently, the data model uses its own names for the different extractors. Eventually we want to align the names of the extractors to comply with the edusharing naming conventions?

MRuecklCC commented 2 years ago

As part of this it may also make sense to simplify the API input and output models. After some discussion with @RMeissnerCC we decided to:

remove the whitelist feature
slightly un-nest the input model
turn the individual metadata fields into Union[Result, Error], to avoid an "all-or-nothing" response. I.e. if a single extractor fails, the response will contain an error message for that extractor, but the other extractor results will be present.

MRuecklCC commented 2 years ago

The simplifications of the API data model was done as part of #100.

RobertMeissner commented 2 years ago

The simplifications of the API data model was done as part of #100.

Is there now anything left in this issue or can it be closed?

MRuecklCC commented 2 years ago

The main issue is still unresolved: https://issues.edu-sharing.net/jira/browse/KBMBF-475

MRuecklCC commented 2 years ago

To make some progress on this front, i spent a while going through the current meta data fields defined by the edusharing service and checking them out in elasticsearch. A couple of those fields are:

Misc attributes

ccm:oeh_accessibility_security: "IT-Sicherheit" (WIP)
ccm:accessibilitySummary: "Barrierefreiheit"
- value-space with 5 variants: http://w3id.org/openeduhub/vocabs/accessibilitySummary
- value-space: A,AA,AAA, BITV, WCAG

Quality attributes

ccm:oeh_quality_personal_law: "Pers\u00f6nlichkeitsrechte"
ccm:oeh_quality_protection_of_minors: "Jugendschutz"
- This currently is a boolean (0/1) field in edusharing.
ccm:oeh_quality_copyright_law: "Urheberrecht"
ccm:oeh_quality_criminal_law: "Strafrecht"
ccm:oeh_quality_login: "Login notwendig"
- This currently is a boolean(0/1) field in edusharing.
ccm:oeh_quality_relevancy_for_education: "geeignet f\u00fcr Bildung (WLO-Suche)"
ccm:oeh_quality_transparentness: "Anbieter Renommee"
ccm:oeh_quality_didactics: "Didaktik/Methodik"
ccm:oeh_quality_medial: "Medial passend"
ccm:oeh_quality_language: "Sprachlich"
ccm:oeh_quality_neutralness: "Neutralit\u00e4t"
ccm:oeh_quality_currentness: "Aktualit\u00e4t"
ccm:oeh_quality_data_privacy: "Datenschutz"
- This is a 0 - 5 stars field, but does not use a vocabulary / value-space....
ccm:oeh_quality_correctness: "Sachrichtigkeit"

Available Extractors

On the other side, we have the current extractor implementations:

Advertisement
EasyPrivacy
MaliciousExtensions
ExtractFromFiles
FanboyAnnoyance
FanboyNotification
FanboySocialMedia
AntiAdBlock
EasylistGermany
EasylistAdult
Paywalls
Security
IFrameEmbeddable
PopUp
RegWall
LogInOut
Cookies
GDPR
Javascript
Accessibility
LicenceExtractor

MRuecklCC commented 2 years ago

Mapping between extractors and meta data fields

As a first step, the following relations come to mind:

ccm:oeh_quality_protection_of_minors (Jugendschutz):
- EasylistAdult would need to be modified to the binary output schema
ccm:oeh_quality_login (Login notwendig)
- LogInOut, RegWall, Paywalls
- Extractors would need to be combined into a single binary output.
ccm:oeh_quality_data_privacy (Datenschutz)
- EasyPrivacy, GDPR, Cookies
- Extractors would need to be combined into a single 0-5 star value space.
ccm:accessibilitySummary
- We could map the AccessibilityExtractors output score to the A,AA,AAA scale.

MRuecklCC commented 2 years ago

Given the example of the ccm:oeh_quality_protection_of_minors it also becomes clear, that the current response data model may be inadequat.

Consider the following two scenarios, where the service receives a request to extract meta information for a website that contains adult advertisement.

The advertisement is detected with the EasylistAdult extractor which immediately makes clear, that the content is not suited as OER, the service could respond with a 0-Star rating for ccm:oeh_quality_protection_of_minors.
The EasylistAdult extractor does not detect the ad (because it's not part of the respective blacklist). If the service responds with a 5-Star rating (because it didn't detect anything) that would be bad. A more conservative approach would be to omit the ccm:oeh_quality_protection_of_minors assessment (better safe than sorry).

Similar arguments can be made for other attributes. In those cases, the response data model for those cases could be either

explicit about it ("hey I am not entirely sure, but I didn't find anything suspicious, here is my X-Star rating")
omit the respective assessment ("hey im not gonna tell you, because im not entirely sure")

In abstract terms:

If the extractor's goal is to guarantee the absence of something that is defined via a blacklist, there will always be the issue that the blacklist may be incomplete.
If the extractor's goal is to guarantee the presence of something that is defined via a whitelist, there will always be the issue that the whitelist may be incomplete.

In both cases we could refrain from responding with an assessment or at least wrap it into a "maybe"/"potentially"

RobertMeissner commented 2 years ago

Regarding your latest comment: so basically, there is no safe way of using black-/whitelists and make a solid statement. All we say is based on us relying on the lists to be "complete", whatever that means

MRuecklCC commented 2 years ago

I read about accessibility ratings and lighthouse

Current valuespace is weird: https://github.com/openeduhub/oeh-metadata-vocabs/issues/18
Lighthouse has some WCAG checks, but no automatic WCAG rating output
There is an online tool that does WCAG ratings: https://www.siteimprove.com/toolkit/accessibility-checker
- Probably uses lighthouse under the hood :-)
- Rather slow as well
- Probably rate limited so not really suitable for our case
- Spits out scores from 0-100 for each WCAG level
Lighthouse output does not automatically provide a WCAG rating. In addition it does not provide all checks to fully assess the WCAG rating. This means, even if we check the detailed output of lighthouse, we cannot fully automate a WCAG rating from it (only give a suggestion)
To deduce a WCAG suggestion from lighthouse we need to analyse the individual checks by hand and correlate them with the WCAG ratings: https://github.com/dequelabs/axe-core/blob/develop/doc/rule-descriptions.md

lummerland commented 2 years ago

Given the example of the ccm:oeh_quality_protection_of_minors it also becomes clear, that the current response data model may be inadequat.

It looks as we need to discuss and decide more ore less every field and mapping because of special characteristics. I think it would be helpful to have more detailed information on top of the "simple" mapping of fields. E.g.

How much confidence do we have in the result of the specific data?
What strategies could be used to put the data into the documents? meaning: can we put it directly into the metadata without risk? Or should we better let it be a proposal that somebody has to accept manually? Or something else?
How much risk do we take when data is wrong, vague or sth else?
...

MRuecklCC commented 2 years ago

As a first shot I will provide a new API endpoint providing the following 4 attributes:

ccm:oeh_quality_protection_of_minors
ccm:oeh_quality_login
ccm:oeh_quality_data_privacy
ccm:accessibilitySummary

The structure will follow what is available on the /extract endpoint, the mapping from Extractor to LRMI meta data field will be implemented in the most trivial way from the extractors listed above.

The endpoint will be POST {base-uri}/lrmi-suggestions. It will take a JSON will the following structure

{
    "url":"https://some-domain.de/path/to/content.html"
}

For now, the endpoint will only provide results for html content. Responses for non html content is unspecified for now. The response body will look as following:

{
"ccm:oeh_quality_protection_of_minors": {
    "stars": 0-5, # may be missing. In that case there will be an exception message
    "explanation": "some human readable string",
    "error": "" # will only be present if extraction failed, in which case neither stars, explanation or extra will be available.
    "extra": {
      # attribute specific extra information. the structure depends on the attribute.
    },
"ccm:oeh_quality_login": {
  # same as above
   },
"ccm:oeh_quality_data_privacy": {
  # same as above
   },
"ccm:accessibilitySummary": {
  # same as above
   }
}

openeduhub / metalookup