Analyze DeepGreen data - Githubissues

acka47 commented 4 years ago

We were asked to check out the DeepGreen data. As we don't have a repo for that (yet) I am posting the issue here. To get a good impression about the data, I suggest to index it in elasticsearch.

Here are the API basics:

Get resource by id:

https://www.oa-deepgreen.de/api/v1/notification/<id>
Example: https://www.oa-deepgreen.de/api/v1/notification/30b0c1d07d61492893302eaa737b97a9

Get resource by indexing data

"https://www.oa-deepgreen.de/api/v1/routed?since=<date>
Example: https://www.oa-deepgreen.de/api/v1/routed?since=2019-07-18 (BTW, this is the date, the first resource was indexed)

Paging & page size

pageSize=<size> with 100 being the maximum, e.g. pageSize=100
page=<page>, from 1 to n, example page=2

In the response, the resource descriptions are found in the notifications array.

Accordingly, we have to do the following:

[x] Page through https://www.oa-deepgreen.de/api/v1/routed?since=2019-07-18&pageSize=100&page=1, get the objects in the notifications array (e.g. with $ curl https://www.oa-deepgreen.de/api/v1/routed?since=2019-07-18&pageSize=100&page=1 | jq -r .notifications[]) and pipe it into a JSON file:
```
#!/bin/bash
i=1
while [ $i -lt 474 ]; do
    i=$(expr $i + 1 ); echo $i
    curl "https://www.oa-deepgreen.de/api/v1/routed?since=2019-07-18&pageSize=100&page=$i" | jq -r .notifications[] >  oa-deepgreen-$i.json
done
```

[x] Make it an ndjson file with:

cat oa-deepgreen-* | jq -c '.' > all-oa-deepgreen.ndjson

[x] Create an ES bulk index file from this (similar to https://github.com/hbz/lobid/issues/411#issuecomment-583416161), probably something like:
```
cat all-oa-deepgreen.ndjson | jq -r '. | "\({ "index":{"_index":"deepgreen", "_type": "Article", "_id": .id } })\n\(.)"' > bulk.ndjson
```
note: ES index names must be lowercased!

[x] Index it on labs/aither:

curl -XPOST https://lobid.org/eslabs/deepgreen/_bulk --data-binary  @bulk.ndjson

[x] Query it like:

curl https://lobid.org/eslabs/deepgreen/_search?q=Mathematics

dr0i commented 4 years ago

Harvested and indexed, see list above.

acka47 commented 4 years ago

Ok, I can now check some basic infos, e.g.:

number of resources with CC-BY license: 39654
- Those with CC-By that come from "Frontiers Media S.A.": https://lobid.org/eslabs/deepgreen/_search?q=metadata.license_ref.type:%22CC-BY%22+AND+metadata.publisher:%22frontiers%20media%22
resources with at least one authorID: 10.607
Resources with at least one author that is affiliated with a German institution: 26328

However, it would be great to have aggregations.. Do we have to configure something for this or can I already view aggregations?

dr0i commented 4 years ago

Fixed kibana by restarting it.

Re aggs: possible by defining fields as keywords. I did this for affiliation - tell me which fields you want to have aggs, I will configure them. But using (huge) literals as keys is not a good idea as shown by this:

curl -XGET  'https://lobid.org/eslabs/deepgreen/_search?q=metadata.author.affiliation:*germany*&pretty=true' -d '
{
  "aggs": {
          "aggs1": {
              "terms": {
                "field": "metadata.author.affiliation",
                "size": 50
              }
          }
        }
}
'

because, as clearly can be see by the result of above's query, literals are seldom unique.

acka47 commented 4 years ago

I am ok with Kibana for now so don't need any aggregations configured.

as clearly can be see by the result of above's query, literals are seldom unique.

Do you mean they are all unqiue? However, I agree that aggregations don't make sense for metadata.author.affiliation.

acka47 commented 4 years ago

Closing.

acka47 commented 4 years ago

What is strange that I can not limit my search to the field metadata.author.affiliation, e.g. https://lobid.org/eslabs/deepgreen/_search?q=metadata.author.affiliation:k%C3%B6ln won't give any hits although there are lots of cases that should show. As it works with https://lobid.org/eslabs/deepgreen/_search?q=k%C3%B6ln, the field must be indexed, though.

dr0i commented 4 years ago

To enable aggregations this field must be of a type that is not analyzed, i.e. you only can lookup with the complete value (i.e. the huge blob). Maybe we should dump the idea of having aggs of this field?

acka47 commented 4 years ago

Maybe we should dump the idea of having aggs of this field?

+1

dr0i commented 4 years ago

Changed the mappings and reindexed again. (note to self: scipts reside at @aither:~/oa-deepgreen).

acka47 commented 4 years ago

Closing this, as we provided some analytics for management and there are no requests pending.

hbz / lobid

Analyze DeepGreen data #418