inveniosoftware-contrib / invenio-classifier

Invenio module for record classification.
GNU General Public License v2.0
6 stars 12 forks source link

workflows: classify_paper adds puzzling keywords #31

Closed jacquerie closed 6 years ago

jacquerie commented 7 years ago

From @jacquerie on June 7, 2017 10:10

For example: https://labs.inspirehep.net/holdingpen/648620 has the keyword gravitational radiation, which doesn't appear anywhere in the PDF, since this is a fluid dynamics paper. They both relate in some way to the wave equation, but that's it...

We might want to fix this before declaring https://github.com/inspirehep/inspire-next/issues/2309 as fixed.

Copied from original issue: inspirehep/inspire-next#2415

jacquerie commented 7 years ago

CC: @kaplun

jacquerie commented 7 years ago

From @kaplun on June 7, 2017 11:44

I'll look at it ASAP.

jacquerie commented 7 years ago

Uh, wait a second, the paper actually contains 4 instances of "gravity waves", so invenio-classifier might not be completely wrong here. But shouldn't it output keywords that are present verbatim in the paper? Or does it try to be smart?

There's still a problem if it tries to be smart, because it looks like it's mixing https://en.wikipedia.org/wiki/Gravity_wave and https://en.wikipedia.org/wiki/Gravitational_wave.

jacquerie commented 7 years ago

From @kaplun on June 7, 2017 14:34

It does try to be smart. I think it does some fuzzyfication. The inner spaghetti code is quite large indeed.

jacquerie commented 7 years ago

Well, then this issue should probably be moved to invenio-classifier. I was thinking that something more sinister was at play here, like classify_paper being called on the wrong PDF or something like that.

BTW I'd say that the real problem for #2309 is #2413, not this issue or #2414.

ksachs commented 6 years ago

BibClassify has to be smart. Physicist are not nice to us and don't use our standard keywords. Both communities use the phrase 'gravity wave'. It's an acronym in the taxonomy and SHOULD be translated to 'gravitational radiation'. It's not a bug but a feature we can not avoid. Do not change this behavior. Sorry I didn't chime in earlier but I was not aware of this issue.

jacquerie commented 6 years ago

It's not a bug but a feature we can not avoid. Do not change this behavior.

Ok! Then we can close this.