How should we import repository statistics?

trugwaldsaenger commented 6 years ago

Initiated by @philboeselager`s participation in the last jointly/edusharing workshop we discussed the integration of statistics of individual repositories into the world map.

As far as I understood there are two quite different approaches to do this:

1) Display statistical data without importing it into the search index

As soon as data about the size of the repository (How many documents are included?), about the used licenses (How many documents are CC BY licensed? How many use other licenses?) as well as the subject (How many resources are available in the field of medicine?) is provided by the API of an repository, it should be quite easy to display it in the form of diagrams within a service profile. Anne Zobel provided a screenshot, which shows how this could look like:

grafik

I think that integrating data like this provides a certain value:

Users can see how many documents are included, what subject they come from and how they are licensed. This is information which helps user to evaluate a repository.
We can show stakeholders how it could look like, if the World Map would be connected to the different content sources collected by it.

I like the way it is implemented in the screenshot (showing only one statistic, which can be switched) and I think it should fit fine into the new full page profile layout into the right column.

2) Import statistical data into OER World Map index

The second solution would be more complex. The idea here would be to import the data and include it into our data model. This should allow us to make things as following:

Include a new field "Size" for services (of the future subtype "repository") and import its value automatically
Include values for the fields "license" and "subject" automatically
Show number of documents behind each value [CC BY (756), CC BY-SA (1578), ...]
Allow searches like: "Show me all repositories, which include more than 20.000 documents
Rank repositories according to size or according to most included CC 0 documents

We even might be able to aggregate this data in the future and show things like "number of OER`s in Germany", "increase of OER production in India last year" etc. For sure this is still a long way...

If we would have a standard for this data future could ideally look like this: A new repository connects via "automatic handshake" with the world map and then imports all relevant data automatially to the map.

The "import solution" for sure is more difficult to implement. Challenges are:

Imported data must be mapped if not using the same standard. E.g. subjects must be mapped to our ISCED categories
automatic import has to be included in data model
How often updates should be imported?
...

So how to go on?

If it is true, that solution 1 ("display solution") is easy to implement, I think we should include it as an example for edusharing and maybe some other repositories. Additionally we should analyse solution 2 ("import solution") deeper, so that we see the challenges here better.

@acka47 @literarymachine : What is your opinion?

acka47 commented 6 years ago

At the team meeting we yesterday decided to think about how we actually would include this data in our data model and then ask repository providers to submit their statistics according to the structure we come up with.

Taking and updating the information from the etherpad on usefull information to collect:

number of resources
used licenses
used subjects
used formats (working sheets, books, ...)
number of users
last edit/new entry (or number of edits/creates in the last week/month)

acka47 commented 6 years ago

There are different options for implementing this in our data structure.

Add a statistics object where we put all the count information.
Adjust our current data model for subject and license information to also add count data.

In this comment, I am focussing on 2.) although 1.) might be the better way to go.

For the number of resources provided by a service, we could just add a property resourceCount or similar to a Service. For counting things we already have in our data (i.e. topics and licenses) one option is to add a level of indirection to about, license statements where we add the count numbers. E.g. for licensing, we could adjust the current licensing information like follows but we will have to use our own licensing property :

{
   "licensing":[
      {
         "count":"23",
         "license":{
            "image":"https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Copyright.svg/197px-Copyright.svg.png",
            "@type":"Concept",
            "name":[
               {
                  "@value":"Copyright",
                  "@language":"en"
               },
               {
                  "@value":"Copyright",
                  "@language":"de"
               }
            ],
            "@id":"https://oerworldmap.org/assets/json/licenses.json#copyright"
         }
      },
      {
         "count":"50",
         "license":{
            "image":"http://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-sa.png",
            "@type":"Concept",
            "name":[
               {
                  "@value":"Creative Commons Attribution Share-Alike",
                  "@language":"en"
               },
               {
                  "@value":"Creative Commons Namensnennung, Weitergabe unter gleichen Bedingungen",
                  "@language":"de"
               }
            ],
            "@id":"https://oerworldmap.org/assets/json/licenses.json#cc-by-sa"
         }
      }
   ]
}

For subjects/about the solution would be similar to the above.

As we don't have the other information yet (formats, # of users, modifications), we are totally free on how to proceed.

trugwaldsaenger commented 6 years ago

In this comment, I am focussing on 2.) although 1.) might be the better way to go.

@acka47: How do we find out, which way is the better one?

literarymachine commented 6 years ago

For counting things we already have in our data (i.e. topics and licenses) one option is to add a level of indirection to about, license statements where we add the count numbers.

@acka47 I was just wondering if the approach in https://github.com/hbz/oerworldmap/issues/941#issuecomment-267125691 could also be applied here:

{
   "license": [
      {
         "@type": "Role",
         "roleName": "License usage",
         "count": "23",
         "license": {
            "@id": "https://oerworldmap.org/assets/json/licenses.json#copyright"
         }
      },
      {
         "@type": "Role",
         "roleName": "License usage",
         "count": "50",
         "license": {
            "@id": "https://oerworldmap.org/assets/json/licenses.json#cc-by-sa"
         }
      }
   ]
}

acka47 commented 6 years ago

@literarymachine I would really like it if schema.org supported something like this but 1.) count is not in schema.org and 2.) Role is not used like this AFAIK. Although its general description covers it ("Represents additional information about a relationship or property.") the examples are all cases where there are relations between at least on agent (person or organization) and aother entity...

acka47 commented 6 years ago

We might do use http://schema.org/AggregateOffer with offerCount here, though...

literarymachine commented 6 years ago

There are different options for implementing this in our data structure.

Add a statistics object where we put all the count information.

If we do this, we would either have to keep in mind that e.g. licence information comes from two places now: (imported) statistics and manual entries. I think there should only be one source of truth. So if we add the statistics object, I think we should get rid of the license and about fields. Which sort of makes it the same problem: how to model relation counts.

acka47 commented 6 years ago

Related to this issue, I made a proposal for publishing repository information, see the email (German) at https://lists.dnb.de/pipermail/dini-ag-kim-oer/2018-August/000067.html.

hbz / oerworldmap