inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

workflows: classify_paper adds puzzling keywords #2415

Closed jacquerie closed 7 years ago

jacquerie commented 7 years ago

For example: https://labs.inspirehep.net/holdingpen/648620 has the keyword gravitational radiation, which doesn't appear anywhere in the PDF, since this is a fluid dynamics paper. They both relate in some way to the wave equation, but that's it...

We might want to fix this before declaring https://github.com/inspirehep/inspire-next/issues/2309 as fixed.

jacquerie commented 7 years ago

CC: @kaplun

kaplun commented 7 years ago

I'll look at it ASAP.

jacquerie commented 7 years ago

Uh, wait a second, the paper actually contains 4 instances of "gravity waves", so invenio-classifier might not be completely wrong here. But shouldn't it output keywords that are present verbatim in the paper? Or does it try to be smart?

There's still a problem if it tries to be smart, because it looks like it's mixing https://en.wikipedia.org/wiki/Gravity_wave and https://en.wikipedia.org/wiki/Gravitational_wave.

kaplun commented 7 years ago

It does try to be smart. I think it does some fuzzyfication. The inner spaghetti code is quite large indeed.

jacquerie commented 7 years ago

Well, then this issue should probably be moved to invenio-classifier. I was thinking that something more sinister was at play here, like classify_paper being called on the wrong PDF or something like that.

BTW I'd say that the real problem for #2309 is #2413, not this issue or #2414.

jacquerie commented 7 years ago

This issue was moved to inveniosoftware-contrib/invenio-classifier#31