inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

Invenio-classifier using keywords a key in dictionary #2456

Closed kaplun closed 7 years ago

kaplun commented 7 years ago

Current Behavior

invenio-classifier is outputting many information in a pythonic way. Unfortunately the structure used is:

{'categories': {'ATLAS': 'HEP',
  'CERN LHC Coll': 'HEP',
  [...]
  'wave function': 'HEP',
  'weak coupling': 'HEP'},
 'complete_output': {'acronyms': {},
  'author_keywords': [],
  'composite_keywords': {'axion: decay constant': {'details': [18, 2],
    'numbers': 1},
   'coupling: Yukawa': {'details': [8, 1], 'numbers': 1},
   'dimension: 2': {'details': [5, 58], 'numbers': 1},
   'electroweak interaction: standard model': {'details': [10, 2],
    'numbers': 1},
   'energy: ground state': {'details': [26, 2], 'numbers': 2},
   [...]
   'transformation: unitarity': {'details': [12, 3], 'numbers': 1}},
  'core_keywords': {'ATLAS': 1,
   'CERN LHC Coll': 6,
   [...],
   'string model': 2,
   'supersymmetry': 3},
  'field_codes': {'g': 'cosmological constant'},
  'single_keywords': {'ATLAS': 1,
   'Chern-Simons number': 14,
   [...]
   'toy model': 1,
   'translation': 3,
   'wave function': 1}}}

Unfortunately this is using the keyword as key in a dictionary, which is then passed as such to ES, that tries to create a guessed mapping for each of them.

Expected Behavior

Keywords should be passed as list of tuples possibly sorted by their importance.

Note: this was partially addressed already in: https://github.com/inveniosoftware-contrib/invenio-classifier/pull/25

jacquerie commented 7 years ago

Or we just disable indexing of this part of the record, since nobody needs this information to be searchable... the same actually goes for all the output of ML algorithms in extra_data.

Note that this is afforded to us by the slogan "data in the DB, stuff to search on in ES", the thing we were discussing during standup.

kaplun commented 7 years ago

I think this could be a good interim solution. However @ksachs mentioned in https://github.com/inspirehep/inspire-next/issues/2328#issuecomment-300735429 that she is searching for keywords in the Holding pen.

@ksachs what is the use case? Why do you actually need to search using keywords in the holding pen at all?

ksachs commented 7 years ago

I just find it weird to have metadata stored in a way that is not (really) searchable. I have no idea yet how we will use the new HP and what we will want to search for.

jacquerie commented 7 years ago

Ah, we didn't relay this back to you, but searching for keywords has already improved. Now you can do https://labs.inspirehep.net/holdingpen/list/?page=1&size=10&q=_extra_data.classifier_results.complete_output.single_keywords.keyword:%22unified%20field%20theory%22 instead of what you said you were doing in the linked issue.