inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

HoldingPen: good records automatically rejected #2524

Open ksachs opened 7 years ago

ksachs commented 7 years ago

2 examples where no keyword (CORE or non-CORE) was extracted on labs, there is no decision from the guesser and good records are rejected automatically. Esp. the first record is a clear case. Something went really wrong in the workflow.

https://labs.inspirehep.net/holdingpen/675064 Neutrino Mass Sum-rule and Neutrinoless Double Beta Decay should have 6 CORE KWs CP, violation; (0neutrino); neutrino; electroweak interaction; neutrino, mass; Gran Sasso; double-beta decay

https://labs.inspirehep.net/holdingpen/675482 The isolated, uniformly moving electron should have 3 CORE KWs Yang-Mills-Higgs theory; Yang-Mills; magnetic monopole; caloron

@fschwenn

ksachs commented 7 years ago

another example: https://labs.inspirehep.net/holdingpen/694229 Lensing Bias to CMB Polarization Measurements of Compensated Isocurvature Perturbations should have 5 CORE KWs dark matter; curvaton; neutrino; baryon number; inflaton

jacquerie commented 7 years ago

should have 3 CORE KWs

Sorry @ksachs, but how are you determining that? In particular, are we using the same ontologies/knowledge bases? Because I see that https://github.com/inspirehep/inspire-next/pull/2282 has not yet been merged...

ksachs commented 7 years ago

when searching for another bug (BibClassify used only title/abstract instead of the fulltext) we were comparing keywords. Yes we are using the same ontology and BibClassify parameters. And if there are keywords on labs they are identical to what we have at DESY. At least as far as I noticed - I don't compare every single KW and I can't systematically search the holdingpen since there is no API I am capable of using. #2282 is an update of the taxonomy with some additional KW that are not relevant for this issue.

ksachs commented 6 years ago

2 new examples: https://labs.inspirehep.net/holdingpen/756285 https://labs.inspirehep.net/holdingpen/756286

ksachs commented 6 years ago

https://labs.inspirehep.net/holdingpen/760574

ksachs commented 6 years ago

https://labs.inspirehep.net/api/holdingpen/762556 "doc": "Mark the workflow object with already-ingested:True.",

Why? arXiv:1710.09270 is not in INSPIRE and 762556 is the only record in the holdingpen.

same for arXiv:1710.09271 https://labs.inspirehep.net/holdingpen/762556

david-caro commented 6 years ago

Just about that last message (762556) (checking one by one):

They were rejected because they are too old (more than 5 days):

(Submitted on 20 Oct 2017)

and thus considered updates, and discarded as we yet don't support updates on labs.

The actual function that checks that is:

    "doc": "IF: args(<function previously_rejected at 0x7dfbd70>, [<function mark at 0x7dfbe60>, <function mark at 0x7dfbf50>]); kwargs().",

I know the names are awful, just bear with me for now, we are working on that.

ksachs commented 6 years ago

Thanks David. I thought this time cut-off was disabled. We discussed it last week with Sam. Is it a lot of work to take that out or increase the time window?

david-caro commented 6 years ago

It's not yet removed, we are working on it, but increasing the window should be easy, @kaplun can you take care of it?

kaplun commented 6 years ago

Sure, it's not configurable though. But I can do a quick deployment. I'll set it to 30 days...

kaplun commented 6 years ago

OK! Set to 30 days. Let's see.

ksachs commented 6 years ago

@fschwenn and I searched for missing arXiv articles of 2017. From those we accepted for INSPIRE there are

wrong 'already-ingested' of 'too-many-days' (11): only 1 from November arXiv:1702.05629 "Mark the workflow object with already-ingested:True." arXiv:1703.05574 "Mark the workflow object with already-ingested:True." arXiv:1708.07444 "Mark the workflow object with already-ingested:True." arXiv:1709.04483 "Mark the workflow object with too-many-days:True. arXiv:1709.06876 "Mark the workflow object with already-ingested:True." arXiv:1709.10399 "Mark the workflow object with already-ingested:True." arXiv:1710.04496 "Mark the workflow object with already-ingested:True." arXiv:1710.04703 "Mark the workflow object with already-ingested:True." arXiv:1710.00618 "Mark the workflow object with already-ingested:True." arXiv:1710.07616 "Mark the workflow object with already-ingested:True." arXiv:1711.06093 "Mark the workflow object with too-many-days:True."

no trace in the HP at all (13) (#2528) arXiv:1701.01022 , arXiv:1701.07062 , arXiv:1702.08285 , arXiv:1703.05573 , arXiv:1708.04550 , arXiv:1708.06728 , arXiv:1708.07361 , arXiv:1708.08897 , arXiv:1711.06094 , arXiv:1711.06044 , arXiv:1711.06547 , arXiv:1711.06674 , arXiv:1711.09009

and 2 with problems getting keywords which should result in an error (#2528) arXiv:1702.04175 , arXiv:1710.10630

StellaCh commented 6 years ago

is this fixed @michamos ?