hbz / lobid

Linking Open Bibliographic Data
https://lobid.org/
Eclipse Public License 2.0
16 stars 4 forks source link

Analyze DeepGreen data #418

Closed acka47 closed 4 years ago

acka47 commented 4 years ago

We were asked to check out the DeepGreen data. As we don't have a repo for that (yet) I am posting the issue here. To get a good impression about the data, I suggest to index it in elasticsearch.

Here are the API basics:

Get resource by id:

Get resource by indexing data

Paging & page size

In the response, the resource descriptions are found in the notifications array.

Accordingly, we have to do the following:

dr0i commented 4 years ago

Harvested and indexed, see list above.

acka47 commented 4 years ago

Ok, I can now check some basic infos, e.g.:

However, it would be great to have aggregations.. Do we have to configure something for this or can I already view aggregations?

dr0i commented 4 years ago

Fixed kibana by restarting it.

Re aggs: possible by defining fields as keywords. I did this for affiliation - tell me which fields you want to have aggs, I will configure them. But using (huge) literals as keys is not a good idea as shown by this:

curl -XGET  'https://lobid.org/eslabs/deepgreen/_search?q=metadata.author.affiliation:*germany*&pretty=true' -d '
{
  "aggs": {
          "aggs1": {
              "terms": {
                "field": "metadata.author.affiliation",
                "size": 50
              }
          }
        }
}
'

because, as clearly can be see by the result of above's query, literals are seldom unique.

acka47 commented 4 years ago

I am ok with Kibana for now so don't need any aggregations configured.

as clearly can be see by the result of above's query, literals are seldom unique.

Do you mean they are all unqiue? However, I agree that aggregations don't make sense for metadata.author.affiliation.

acka47 commented 4 years ago

Closing.

acka47 commented 4 years ago

What is strange that I can not limit my search to the field metadata.author.affiliation, e.g. https://lobid.org/eslabs/deepgreen/_search?q=metadata.author.affiliation:k%C3%B6ln won't give any hits although there are lots of cases that should show. As it works with https://lobid.org/eslabs/deepgreen/_search?q=k%C3%B6ln, the field must be indexed, though.

dr0i commented 4 years ago

To enable aggregations this field must be of a type that is not analyzed, i.e. you only can lookup with the complete value (i.e. the huge blob). Maybe we should dump the idea of having aggs of this field?

acka47 commented 4 years ago

Maybe we should dump the idea of having aggs of this field?

+1

dr0i commented 4 years ago

Changed the mappings and reindexed again. (note to self: scipts reside at @aither:~/oa-deepgreen).

acka47 commented 4 years ago

Closing this, as we provided some analytics for management and there are no requests pending.