inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

workflows: invenio-classifier almost never fires #2413

Closed jacquerie closed 7 years ago

jacquerie commented 7 years ago

Per user request, the logic to automatically reject papers defers the decision to the curator when invenio-classifier did not fire: https://github.com/inspirehep/inspire-next/blob/9eed5e3d61ea7e04ff9d7192480d9e0ff1c62a46/inspirehep/modules/workflows/tasks/actions.py#L133-L134

The problem is that invenio-classifier almost never fires, so too much stuff is put in front of curators. For example, https://labs.inspirehep.net/api/holdingpen/653398 went through classify_paper but no classifier_results was added to it.

This needs to be fixed before we declare https://github.com/inspirehep/inspire-next/issues/2309 to be fixed.

jacquerie commented 7 years ago

CC: @kaplun

kaplun commented 7 years ago

I don't understand. https://labs.inspirehep.net/api/holdingpen/653398 has classifier_results:

"classifier_results": {
      "categories": {
        "Boltzmann equation": "HEP", 
        "Grid computing": "HEP", 
        "S-matrix": "HEP", 
        "algebra": "HEP", 
        "color": "HEP", 
        "computer": "HEP", 
        "conservation law": "HEP", 
        "costs": "HEP", 
        "critical phenomena": "HEP", 
        "defect": "HEP", 
        "distribution function": "HEP", 
        "engineering": "HEP", 
        "entropy": "HEP", 
        "flow": "HEP", 
        "fragmentation": "HEP", 
        "kinematics": "HEP", 
        "kinetic": "HEP", 
        "nonlinear": "HEP", 
        "nonlocal": "HEP", 
        "phase space": "HEP", 
        "scalar particle": "HEP", 
        "simplex": "HEP", 
        "site": "HEP", 
        "solids": "HEP", 
        "statistical mechanics": "HEP", 
        "symmetry breaking": "HEP", 
        "turbulence": "HEP", 
        "viscosity": "HEP"
      }, 
      "complete_output": {
        "acronyms": {}, 
        "author_keywords": [], 
        "composite_keywords": {
          "density, scalar": {
            "details": [
              7, 
              6
            ], 
            "numbers": 1
          }, 
          "dimension, 2": {
            "details": [
              0, 
              36
            ], 
            "numbers": 3
          }, 
          "effect, higher-order": {
            "details": [
              8, 
              4
            ], 
            "numbers": 1
          }, 
          "energy, cascade": {
            "details": [
              3, 
              3
            ], 
            "numbers": 3
          }, 
          "fluid, coupling": {
            "details": [
              6, 
              2
            ], 
            "numbers": 1
          }, 
          "fluid, magnetic": {
            "details": [
              6, 
              14
            ], 
            "numbers": 1
          }, 
          "fluid, velocity": {
            "details": [
              6, 
              22
            ], 
            "numbers": 2
          }, 
          "hydrodynamics, magnetic": {
            "details": [
              0, 
              14
            ], 
            "numbers": 2
          }, 
          "lattice, dependence": {
            "details": [
              20, 
              2
            ], 
            "numbers": 1
          }, 
          "magnetic field, axial": {
            "details": [
              35, 
              1
            ], 
            "numbers": 1
          }, 
          "magnetic field, effect": {
            "details": [
              35, 
              8
            ], 
            "numbers": 1
          }, 
          "magnetic field, low": {
            "details": [
              35, 
              0
            ], 
            "numbers": 1
          }, 
          "moment, higher-order": {
            "details": [
              20, 
              4
            ], 
            "numbers": 1
          }, 
          "radiation, effect": {
            "details": [
              1, 
              8
            ], 
            "numbers": 1
          }, 
          "stability, magnetic": {
            "details": [
              27, 
              14
            ], 
            "numbers": 1
          }, 
          "tensor, energy-momentum": {
            "details": [
              6, 
              0
            ], 
            "numbers": 1
          }, 
          "vortex, model": {
            "details": [
              11, 
              12
            ], 
            "numbers": 1
          }
        }, 
        "core_keywords": {
          "scalar particle": 2
        }, 
        "field_codes": {}, 
        "filtered_core_keywords": {}, 
        "single_keywords": {
          "Boltzmann equation": 1, 
          "S-matrix": 3, 
          "algebra": 5, 
          "color": 1, 
          "computer": 1, 
          "conservation law": 8, 
          "costs": 3, 
          "critical phenomena": 1, 
          "defect": 1, 
          "distribution function": 15, 
          "engineering": 1, 
          "entropy": 16, 
          "kinematics": 1, 
          "kinetic": 7, 
          "nonlinear": 2, 
          "nonlocal": 2, 
          "scalar particle": 2, 
          "statistical mechanics": 3, 
          "symmetry breaking": 1, 
          "viscosity": 4
        }
      }, 
      "fast_mode": false

Or maybe it is just that someone manually fixed this entry meanwhile?

kaplun commented 7 years ago

Actually it's the second part of this code that return False:

score = relevance_prediction.get('max_score')
    decision = relevance_prediction.get('decision')
    all_class_results = classification_results.get('complete_output')
    core_keywords = all_class_results.get('core_keywords')

    return (
        decision.lower() == 'rejected' and
        score > 0 and
        len(core_keywords) == 0
    )

For this record the max_score is actually negative. Actually this is not even a rejected record. :-1: We should find a different example...

jacquerie commented 7 years ago

Uh, probably I made a mistake while copying and pasting. Well, here are a few more examples from the first two pages of halted records:

https://labs.inspirehep.net/api/holdingpen/654418 https://labs.inspirehep.net/api/holdingpen/654413 https://labs.inspirehep.net/api/holdingpen/654282 https://labs.inspirehep.net/api/holdingpen/654276 https://labs.inspirehep.net/api/holdingpen/654275

kaplun commented 7 years ago

OK, first in the list was due to downloaded PDF being compressed due to bug solved in #2411. So it all makes sense that these were not extracted. I suspect the same misbehavior should be valid for all PDFs from when I refactored the workflow to centralize download of PDF till when #2411 will be deployed.