iDigBio / research-project-ideas

Project ideas and discussion for research using iDigBio data and resources
MIT License
4 stars 0 forks source link

Counts of terms summarized by field #11

Open mjcollin opened 7 years ago

mjcollin commented 7 years ago

Austin wants to know the number of times a term appears in each field. His use case is that he has terms of interest but not all terms are of interest in all fields. "lake" may be interesting in the locality field but not the collector field.

This is solvable with a term-document index. Another good data product to generate.

godfoder commented 7 years ago

I was also talking to Annika Smith, and she expressed an interest in helping build/operate the pipeline for canonicalizing raw values to controlled vocabularies to ontologies. This is a more complicated use case than Austin's but it would be good to plan for in the implementation.

debpaul commented 7 years ago

Please add to this challenge, that the raw verbatim data is what might (in the future) match against BHL Literature. So, we need a way to keep the raw, and provide canonical data. And, researchers are going to be lots happier about this (adopt, implement, add to), if they are involved in the process.

On 2017-04-25 1:22 PM, Alex Thompson wrote:

I was also talking to Annika Smith, and she expressed an interest in helping build/operate the pipeline for canonicalizing raw values to controlled vocabularies to ontologies. This is a more complicated use case than Austin's but it would be good to plan for in the implementation.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/iDigBio/research-project-ideas/issues/11#issuecomment-297103655, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2gS-Ew9uLaOr5AYRgaerqCf2d192bMks5rzivcgaJpZM4NHyjH.

-- -- Upcoming iDigBio Events https://www.idigbio.org/calendar -- Deborah Paul, iDigBio Digitization and Workforce Training Specialist iDigBio -- Steering Committee Member, SPNHC Liaison, SYNTHESYS3 Representative Institute for Digital Information, 234 LSB Florida State University Tallahassee, Florida 32306 850-644-6366

themerekat commented 7 years ago

Attached is Austin's and my specific use case of the terms (first column) we would like to search for and the fields (second row) in which we want to do so. The information we are hoping to get is how many times each term shows up in each field (which is why the datasheet is formatted as an empty matrix). SearchTermsAndFields.txt

godfoder commented 7 years ago

Here is the matrix filled in from the current iDigBio dataset. The API currently doesn't support the needed query type, but adding it wouldn't be that hard either.

matrix.txt

themerekat commented 7 years ago

@godfoder this is awesome, thanks! I would appreciate the ability to query all fields using the API, since not all of the fields I explore are currently supported.

godfoder commented 7 years ago

The API supports the fields, it's just the match query subtype that allows you to do multi word queries against parsed fields that aren't supported.

For single word queries you can just use "data.dwc:fieldName": "value" and it will work.

Ex. https://search.idigbio.org/v2/summary/count/records?rq={"data.dwc:habitat":"introduced"}

themerekat commented 7 years ago

Huh, interesting! I was under the impression that not all datafields were searchable, just the ones listed on the index: https://github.com/iDigBio/idigbio-search-api/wiki/Index-Fields

mjcollin commented 7 years ago

You can look at the fields metadata endpoint linked there to see all fields that are available and the syntax for referring to them in API queries. I recomend installing a JSON plugin in your browser, like JSONView to make it pretty:

http://search.idigbio.org/v2/meta/fields/records

godfoder commented 7 years ago

@mjcollin Its a simple loop over terms and values into a count query, but I didn't post the code because I needed to use elasticsearch directly rather than the API since we don't currently support match queries. The meat of it is:

es.count(index="idigbio", doc_type="records", body={
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            f: v
          }
        }
      ]
    }
  }
})

where f is the field name, and v is the value

themerekat commented 7 years ago

Is it possible to use elasticsearch to find phrases (i.e. words separated by spaces)? We're really interested in how many results are produced when searching for not only words, but phrases. Also, is it possible to get the results from ALL fields? I'm thinking the result would be a matrix like before, with counts of all of the words (and phrases) for every field available on iDigBio. How hard would it be to run this type of query?

debpaul commented 7 years ago

Oh yes - phrases ;-) Now we're talking Carrot-squared-like functionality. Matt, Alex, you've heard this before :-D

Can we, as a separate step - get Katie and Austin up and running an instance of Carrot squared? Maybe they can find someone up here at FSU that could help them do this? Is it the right tool for the job? (at least the phrases job)?

Deb

On 2017-04-28 9:06 AM, Katie Pearson wrote:

Is it possible to use elasticsearch to find phrases (i.e. words separated by spaces)? We're really interested in how many results are produced when searching for not only words, but phrases. Also, is it possible to get the results from ALL fields? I'm thinking the result would be a matrix like before, with counts of all of the words (and phrases) for every field available on iDigBio. How hard would it be to run this type of query?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/iDigBio/research-project-ideas/issues/11#issuecomment-297992685, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2gS9E3wUp905yvdSoG0lHIgZ2XIMJPks5r0eROgaJpZM4NHyjH.

themerekat commented 7 years ago

@debpaul are you opening up cans of worms for us...? ;)