Open kaplun opened 8 years ago
cc: @david-caro
So we could split it in INSPIRE_categories
and external_categories
. INSPIRE_catagories will not need any scheme or source. So just list of strings.
I'm always confused about caps and lowercase, isn't it better to keep everything in lowercase? (As if they were variable names in python)
Sure!
@inspirehep/inspire-content @inspirehep/inspire-dir
Given the following facts:
Given the following thoughts:
~It is proposed here to simplify the data model and hence curation inteface in the following way:~ ~ store only INSPIRE categories~ ~ upon ingestion external categories (e.g. arXiv) are mapped to INSPIRE ones~ ~ if no external category we use magpie to guess one based on abstract~ ~ upon migration from legacy we preserve INSPIRE categories and we map existing arXiv ones into INSPIRE.~ ~* Present to users a facet based on INSPIRE categories.~
Edit: See below
So we only use magpie if there are no external categories?
If we don't have any INSPIRE category (they might come through a different mapping in hepcrawl, for example)
I strongly disagree.
Our user community understands what hep-th means, for instance, and ought to be able to search accordingly
Possibly, we've a 'topic' search, which has all of INSPIRE present 'categories'. And arXiv-category search, which is meant to work only on arXiv content, and not second-guessed INSPIRE attribution
I think @michamos can contribute his idea here.
OK: New proposal:
Notes: in principle all papers/records of INSPIRE should belong to at least one INSPIRE category. On the contrary only records harvested from arXiv will belong to a given arXiv categories.
I agree that we should definitely store the arxiv categories. I'm not so sure whether they should appear as facets. As Sam already pointed out only a subset of records has an arXiv category. I assume it wouldn't be evident to all users that focusing on hep-th e.g. would give only arXiv papers. And why should we single out arXiv papers anyway?
On Dec 2, 2016, at 4:48 PM, Samuele Kaplun notifications@github.com<mailto:notifications@github.com> wrote:
OK: New proposal:
Notes: in principle all papers/records of INSPIRE should belong to at least one INSPIRE category. On the contrary only records harvested from arXiv will belong to a given arXiv categories.
— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHubhttps://github.com/inspirehep/inspire-schemas/issues/41#issuecomment-264485570, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AM1-O7s_WCCuaWCz_fbT9NU9MneLrGHEks5rED3VgaJpZM4K2QIg.
I agree. But there might be some non-misleading way to expose such a feature with clever UI, etc.
On Fri, Dec 2, 2016 at 10:52 PM, annetteholtkamp notifications@github.com wrote:
I agree that we should definitely store the arxiv categories. I'm not so sure whether they should appear as facets. As Sam already pointed out only a subset of records has an arXiv category. I assume it wouldn't be evident to all users that focusing on hep-th e.g. would give only arXiv papers. And why should we single out arXiv papers anyway?
- Annette
On Dec 2, 2016, at 4:48 PM, Samuele Kaplun <notifications@github.com< mailto:notifications@github.com>> wrote:
OK: New proposal:
- we store arXiv categories (originated from arXiv) into arXiv categories
- we store INSPIRE categories (or topics or subjects) into INSPIRE categories:
- upon migration from legacy: copying them if they already exist
- upon migration from legacy: generating them from arXiv categories when available
- using magpie to guess them from the abstract (or title) in all the other cases.
- We propose to present to user 2 separate facets (initially folded, in order not to waste too much space):
- INSPIRE categories/topics/subjects (easy): human-friendly names
- arXiv categories: possibly presented into a foldable tree structure to represent the complex hiearchy of arXiv. Anyway only categories for which we have papers will be displayed. (in principl
Notes: in principle all papers/records of INSPIRE should belong to at least one INSPIRE category. On the contrary only records harvested from arXiv will belong to a given arXiv categories.
— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ inspirehep/inspire-schemas/issues/41#issuecomment-264485570, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AM1-O7s_WCCuaWCz_ fbT9NU9MneLrGHEks5rED3VgaJpZM4K2QIg.
— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/inspirehep/inspire-schemas/issues/41#issuecomment-264621927, or mute the thread https://github.com/notifications/unsubscribe-auth/ABGK0a2VZZTdcztEYR22mLVe2Z4pA9YPks5rERGrgaJpZM4K2QIg .
might be some non-misleading way
Sort of a hierarchical facet comes to mind. First arXiv, then it's categories.
BTW: Quite often one does not search for a term, but having the keyword on display gives additional clues about the content. hep-th is usually a different sort of paper than hep-ph or hep-ex. IOW it's not only computers looking at the data, thus IMHO bean counting alone doesn't suffice.
I think this is a perfect example where more data/evidence is needed, as well as more testing - at least for the user facing part. I would recommend consulting Stella for this.
Okok, for now, we all seem to agree on storing arXiv + inspire categories, just not yet sure how to expose that info to the users right? (to @rikirenz can keep working on the backend part)
Florian and I delete INSPIRE categories derived from arXiv categories if they don't fit. Usually these are cross-listings which on arXiv mean 'also of interest for...' whereas on INSPIRE they are equivalent to the main category.
Concerning facets: I would use only INSPIRE categories (gut feeling, no strong opinion). If people really want to search for arXiv categories they can still do that explicitly - right?
Florian and I delete INSPIRE categories derived from arXiv categories if they don't fit.
Who/what process has added the wrong INSPIRE category in the first place? We currently have a mapping that unambiguously maps an arXiv category to an INSPIRE one:
https://github.com/inspirehep/inspire-next/blob/master/inspirehep/config.py#L1445-L1604
Concerning facets: I would use only INSPIRE categories (gut feeling, no strong opinion). If people really want to search for arXiv categories they can still do that explicitly - right?
In principle both searches are possible, but we have stats demonstrating that users are not able to search with them, via search syntax. On the other hand facets are very easy to manipulate. Indeed there is no hurry on this, so we can indeed see the results of user testing the alternatives.
Not wrong in the sense of 'wrong mapping', but wrong in the sense of 'this is not about Experiment-HEP'. At arXiv articles get cross-listed for better visibility 'also of interest for...', at INSPIRE the subject always means 'is about', we don't have the notion of secondary category.
In the new model a record can have more than one INSPIRE subject, so crosslisting is preserved.
IMHO users will not be damaged by having a record to belong to multiple categories. It's just a way for a paper to be reached. Once reached, the user will be able to make is own judgement. But indeed we are making several assumptions here. So worth doing a user-checking.
side effect: everything with a INSPIRE category *-HEP will (automatically) get CORE. I don't need / want user-checking to know that The Escaramujo Project: instrumentation courses during a road trip across the Americas is not CORE. And I want to be in control of INSPIRE subjects even for arXiv papers. If you keep arXiv and INSPIRE categories in sync we completely depend on arXiv. So: derive INSPIRE category automatically from arXiv on ingestion: perfect. But later on they should be independent.
@ksachs I think that is the idea, the INSPIRE categories are only guessed/automatically added if they are not there already, and only on ingestion. After that, they can be manually changed if needed.
The point that @kaplun is trying to make, is that we can have multiple INSPIRE ones, just as we have multiple arXiv ones, with the same treatment to main category and multiple secondary ones.
Yeah, plus I was really curios to know if there was maybe something we could correct, since you were mentioning that you were actually removing INSPIRE categories. Fully understood now.
- In the last 6 months only ~0.01% users queried INSPIRE using arXiv categories (or any category at all)
Correcting myself: the ratio of
SELECT count(*)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
type=8 AND
idaction_name=idaction AND
name LIKE '% fc %'
);
and
SELECT count(*)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
type=8 AND
idaction_name=idaction
);
is 0.07%. Still low, but mostly I wanted to write down how to compute this kind of number if someone else is interested.
Also: this is the ratio of queries containing fc
, not users. Computing the ratio of users can be done, but is slightly harder.
It's actually pretty easy. The ratio of:
SELECT COUNT(DISTINCT idvisitor)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
type=8 AND
idaction_name=idaction AND
name LIKE '% fc %'
);
and
SELECT COUNT(DISTINCT idvisitor)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
type=8 AND
idaction_name=idaction
);
is 0.1%.
That is, of all distinct users that made at least a query on INSPIRE in the last six months, 0.1% of them used the fc
keyword.
@annetteholtkamp, rightly points out, of course, that it does make quite a difference to know if an INSPIRE category had been:
magpie
)arxiv
)cataloger
).So finally in our data structure seems like we still need to capture the source for INSPIRE categories.
If the inspire categories guess is done only on ingestion and only if not there already, I don't think that it adds much value to track the source of the categories, let me explain (it's very possible that I just don't get your point ;), I don't see @annetteholtkamp rationale here).
The arxiv keywords come from outside, and will never change, so letting them have their own section already expresses the source (passive info).
The inspire ones, we just care if they are right or not, if they are not right, we just change them, no matter if they come from magpie, if they came from arxiv or from a cataloger no?
I only see a couple of use cases where we would want that info, to train better magpie by comparing the result it gave with the one from the cataloger, or something like getting a list of all the records with automated categories... though I don't see that being very useful, as we can just train magpie with the whole set of data just expecting it to be correct (as it passed validation from a cataloger).
How do you envisage a workflow where a cataloger check a record only 2 weeks after it arrives on INSPIRE. At that point we already have ingested it, but the INSPIRE category has been already guessed by arXiv/magpie. How the cataloger can now if it can trust the INSPIRE category?
Beside this small point I agree with you that in general we don't care between arXiv Vs. magpie.
Well, it does not matter who put it no? Just if it is correct or not, and for that you don't care of the origin right? Unless, there's some extra work associated with checking the correctness itself. In that case it makes sense to add that 'already verified' flag.
It's about the trust level. Up to now the only sources for subject categories where catalogers (Annette, Florian and me) or arXiv (obvious to spot). If we have more sources (magpie, journal category, ...) it would be nice to know whether it is worth a second look.
Excellent point, @ksachs. This calls for the notion of provenance (for later audit should the case be) in the record. Didn't Invenio offer something on that?
@salmele nothing out of the box. It's up to us to decide what provenance to store for what and where, and to implement support for it.
Invenio supports basic history, information, such as this version of record X was touched by curator Y, but we don't know which field is actually involved.
OK. So all in all, we don't simplify data-model WRT INSPIRE subjects, but we keep them as list of objects with two attributes: source
(enum with curator
, magpie
, arxiv
) and value
(enum among the current list of subjcets).
BTW how shall we call it? inspire_subjects
or _categories
or _topics
?
@salmele Quick question, what should we do when:
This (and choosing the name) is currently blocking @rikirenz on #47.
I am concerned to preserve arXiv categories for arXiv-ingested preprints for user to (faceted) search for them, as they of course are aware of what arXiv categories are.
I am neutral about what to do with INSPIRE subjects and I recommend @ksachs or @annetteholtkamp decide how they want to define the ontology of sources and their priority
i concur.
On Tue, Dec 6, 2016 at 4:07 PM, Salvatore Mele notifications@github.com wrote:
I am concerned to preserve arXiv categories for arXiv-ingested preprints for user to (faceted) search for them, as they of course are aware of what arXiv categories are.
I am neutral about what to do with INSPIRE subjects and I recommend @ksachs https://github.com/ksachs or @annetteholtkamp https://github.com/annetteholtkamp decide how they want to define the ontology of sources and their priority
— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/inspirehep/inspire-schemas/issues/41#issuecomment-265314279, or mute the thread https://github.com/notifications/unsubscribe-auth/ABGK0XVH1wuD0CLOcYS549sI3e2cTMhcks5rFfilgaJpZM4K2QIg .
@david-caro IMHO there's not going to be arXiv+magpie (since magpie is used only in case of no arXiv). and cataloger wins over the two. So in case there is same term for arXiv and cataloger we preserve the cataloger one.
Actually, we could run magpie also on arXiv papers and keep magpie’s suggestion in case of difference to arXiv. This could be helpful for the content people as a hint to carefully check the subject.
On 7 Dec 2016, at 08:58, Samuele Kaplun notifications@github.com wrote:
@david-caro https://github.com/david-caro IMHO there's not going to be arXiv+magpie (since magpie is used only in case of no arXiv). and cataloger wins over the two. So in case there is same term for arXiv and cataloger we preserve the cataloger one.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/inspirehep/inspire-schemas/issues/41#issuecomment-265380839, or mute the thread https://github.com/notifications/unsubscribe-auth/AM1-OxcZWXYAuPj0fSbTqwXTzgDIWuTxks5rFmcWgaJpZM4K2QIg.
For the avoidance of doubt: using magpie or human classification on arXiv would create a new INSPIRE subject and NOT modify the arXiv category (and cross-listing) as chosen by the submitter (which we'd show somewhere else)... right?
@salmele right: arXiv categories are never going to be touched.
So I propose then having this:
arxiv_categories
with the categories list from arxiv, that are just populated on ingestion.inspire_categories
with a list of term-source tuples that must be unique, in the sense that there's no repeated tuple, though it might be that we have the same term with different sources:
term
is the actual name of the category it represents.source
is one of (this list might get extended in time):
cataloger
(for manually set categories)arxiv
(for categories derived from the arxiv ones)magpie
(for automatically guessed categoriesundefined
(for migrated records that have no source on their categories, not sure if there are any though, but just in case).Do we want to go for the undefined
? In general so far, when something is undefined
we simply haven't specified a value. In Invenio 1 this had some implications because it wasn't possible to search for those things having no-value. Not sure for Elasticsearch. @jacquerie ?
A field arxiv_categories with the categories list from arxiv, that are just populated on ingestion
...preserving the concept of primary category in arXiv (current PRIMARC in SPIRES syntax and dedicated MARC field)
Yup. This is something that can be implemented at search time. The first arXiv category in the list of arXiv categories will be searchable also with primarc.
Ticketized in inspirehep/inspire-next#1791 so we don't forget)
TL;DR: adopt arXiv categories instead of INSPIRE categories
My opinion is rather different from those expressed before. So let me summarize some of the different viewpoints as far as I understand them:
One point of view is largely missing, the user point of view, which is probably the most important. I was a simple user until not too long ago, and my point of view is that INSPIRE categories are terrible, and I would suggest getting rid of them in their current form, for the following reasons.
as @jacquerie showed, ~0.1% of our users ever tried to perform a search with the 'fc' keyword. Of those, I am sure a large fraction is doing it wrong, e.g. fc hep-th
instead of fc t
. To the contrary, arXiv categories have been user-visible for 25 years and everyone knows what hep-th
means. This goes to the extent that even inside INSPIRE, we use arXiv categories for HEPNAMES and jobs instead of the INSPIRE ones. So we don't have any legacy to preserve here, and we can use the migration as an opportunity to rethink the whole concept. I would suggest sticking as closely as possible to the arXiv categories that the people are familiar with. (For anecdotal evidence, I had never heard of INSPIRE categories as a user, and am still struggling to remember the less common ones.)
there are three types of categories:
See the list on the bottom of the INSPIRE categories and the arXiv categories that map to it. They are inconsistent, as General Physics is NOT the same as physics.gen-ph (which is a euphemism for garbage on the arXiv), and Data Analysis and Statistics does not contain the stat.* statistics categories. And who knows what Other means?
the purpose of categories is to organize the records in such a way that the user can easily filter them out and focus on what she is interested in. Either she is interested in a small category, in which case she is looking conceptually for a single arXiv category, but with another name, or she is not. In the latter case, the category is probably useless anyway. For example, if someone wants to know more about entanglement entropy (which is a hot subject right now in hep-th, but was born in quant-ph and is used quite a lot in cond-mat), he is not interested whether the paper is General Physics, but in the distinction between quant-ph and cond-mat.* ). More importantly, our coverage will probably be quite poor in those areas, but the user is not necessarily aware of this fact.
So I would suggest overhauling INSPIRE categories to match arXiv categories. For arXiv papers, we adopt the arXiv category as the INSPIRE category (keeping the distinction primary/secondary also on INSPIRE, and allowing the user to search/facet based on primary only or both primary and secondary). For records that do not come from arXiv, we put them in a specific category (e.g. physics.acc-ph) if we care about it, or we put them in a top-level category (e.g. astro-ph or math) if we don't want to zoom in.
If we keep the primary/secondary distinction, I don't think there would be any need for overriding categories (maybe @ksachs still sees a usecase), as we can trust the arXiv moderators which vastly outnumber the handful of people assigning categories in INSPIRE and are active researchers in their fields. But we could still have the provenance field just to be sure, and for the case in which we add a non-arXiv record that is later added to arXiv. And assigning articles to top-level categories should not be too difficult, and the number of them still manageable for magpie (with the added benefit that we could train it on arXiv data).
{'Accelerators': ['physics.acc-ph'],
'Astrophysics': ['physics.space-ph',
'astro-ph.EP',
'astro-ph.GA',
'astro-ph.SR',
'astro-ph.CO',
'astro-ph',
'astro-ph.HE'],
'Computing': ['cs.ET',
'cs.RO',
'cs.CY',
'cs.NE',
'cs.DB',
'cs.DC',
'cs.SI',
'cs.DL',
'cs.CV',
'cs.FL',
'cs.SD',
'cs.PF',
'cs.LG',
'cs.DS',
'cs.OH',
'cs.OS',
'cs.LO',
'cs.MM',
'cs.AI',
'physics.comp-ph',
'cs.GT',
'cs.IR',
'cs.NA',
'cs.SE',
'cs.CL',
'cs.CG',
'cs.DM',
'cs.SY',
'cs.GL',
'cs.IT',
'cs.CR',
'cs.MS',
'cs.SC',
'cs.CC',
'cs.AR',
'cs.GR',
'cs.NI',
'cs.MA',
'cs.PL',
'cs.CE',
'cs.HC',
'cs'],
'Data Analysis and Statistics': ['physics.data-an'],
'Experiment-HEP': ['hep-ex'],
'Experiment-Nucl': ['nucl-ex'],
'General Physics': ['quant-ph',
'cond-mat.stat-mech',
'cond-mat.mes-hall',
'cond-mat.supr-con',
'physics.plasm-ph',
'cond-mat',
'cond-mat.other',
'physics.class-ph',
'nlin',
'cond-mat.quant-gas',
'cond-mat.dis-nn',
'nlin.CD',
'cond-mat.soft',
'nlin.CG',
'cond-mat.str-el',
'physics.ao-ph',
'cond-mat.mtrl-sci',
'physics.gen-ph',
'nlin.AO',
'physics.atm-clus',
'physics.flu-dyn',
'physics.atom-ph',
'physics.optics',
'physics',
'physics.geo-ph'],
'Gravitation and Cosmology': ['gr-qc'],
'Instrumentation': ['astro-ph.IM', 'physics.ins-det'],
'Lattice': ['hep-lat'],
'Math and Math Physics': ['patt-sol',
'math.GT',
'math.CV',
'math.MP',
'math.GM',
'math.PR',
'math.GR',
'math.DG',
'math.NA',
'math.AP',
'math.CA',
'math.LO',
'math.NT',
'math.AG',
'math.KT',
'q-alg',
'math.ST',
'math.CT',
'math.QA',
'alg-geom',
'math',
'math.DS',
'math.FA',
'math.CO',
'math.SP',
'math.MG',
'math.GN',
'math.AT',
'nlin.PS',
'math.OC',
'math.SG',
'math.HO',
'math.RT',
'math.IT',
'math.RA',
'math.OA',
'math-ph',
'dg-ga',
'math.AC',
'solv-int',
'nlin.SI'],
'Other': ['q-fin.TR',
'q-bio.PE',
'q-bio.CB',
'q-bio.BM',
'q-fin.GN',
'q-fin.PR',
'stat.AP',
'physics.chem-ph',
'physics.pop-ph',
'q-bio.MN',
'stat.CO',
'stat.ML',
'physics.hist-ph',
'q-fin.CP',
'stat.OT',
'q-bio.TO',
'q-fin.EC',
'q-bio.GN',
'q-fin.PM',
'physics.med-ph',
'stat.TH',
'physics.bio-ph',
'q-bio.SC',
'physics.soc-ph',
'physics.ed-ph',
'q-bio.OT',
'q-bio.QM',
'q-fin.ST',
'q-bio.NC',
'q-fin.RM',
'q-fin.MF',
'stat.ME'],
'Phenomenology-HEP': ['hep-ph'],
'Theory-HEP': ['hep-th'],
'Theory-Nucl': ['nucl-th']}
IMO we are guessing too much, the inspire categories have been hidden so far so we don't have usage data, I'd try to get some from the users before doing any big effort either way (that might require some effort too, but having the machinery to easily get that kind of things will help on deciding on other features too).
@michamos for us developers, can you exactly define or point us to the definition of what is:
E.g. with some concrete example to reasoning on.
@michamos for us developers, can you exactly define or point us to the definition of what is:
primary arXiv category
the thing we put into 037, e.g. for https://inspirehep.net/record/1501963 it is
hep-th
secondary arXiv category for the same record, they arecond-mat.stat-mech
andphysics.flu-dyn
top-level arXiv category a category in bold on https://arxiv.org/, which might have subcategories, so hep-th or physics. (probably using `mathinstead of
math.*though to take also
math-ph`). E.g. with some concrete example to reasoning on.
IMO we are guessing too much, the inspire categories have been hidden so far so we don't have usage data
the fact they are hidden and completely unkwown is an opportunity to rethink INSPIRE categories IMHO
Hi Micha,
thanks for your detailed input from the user side. Point taken - visibility is an issue. A pitty that we don't have usage statistics from SPIRES - that's our legacy. Both field-codes (= subject, category, ...) and keywords are neglected on INSPIRE.
However you are talking about something you barely know (as you say yourself). The main use-case was that people would browse (we even had printed lists) through the daily inputs in their category. And from the arXiv listings I believe this is still the case.
First some statistics - records added the last 2 years:
year 2016 / 2015 / 2015core
all 76795 / 51412 / 28802
arxiv 22558 / 25454 / 17228
a 10595 / 10957 / 4982
b 6667 / 6007 / 1494
c 912 / 610 / 351
e 5313 / 5311 / 5087
g 5941 / 6463 / 4479
i 5143 / 5086 / 3171
l 1063 / 1205 / 1202
m 2718 / 3160 / 1548
n 19345* / 4451 / 2064
o 422 / 503 / 169
p 9304 / 9008 / 8750
q 3894 / 4194 / 1692
t 6819 / 7422 / 7304
x 17809* / 3717 / 1568
* reharvest of nucl journals
Btw: other = non-physics
We have one big category (with a lot of non-core stuff we don't want to waste time on): astro, two small category: lattice and computing, all the rest is around 5k/y. Which shows that the balance is not too bad.
You forget that the INSPIRE content is not the same as the arXiv content. Only about half of our records come from arXiv. And we have only a very selective part of arXiv. Where arXiv categories are split in subcategories we barely have records. For core records the big categories are p (9k), t (7k) and e (5k) which we might want to break down into smaller sub-categories; which SPIRES used to have, but that was before my time. Actually SPIRES had categories before arXiv existed.
The assignment of categories has to be feasible which it is not if the categories are too finegrained. It's mainly Florian and me who assign categories to the non-arXiv stuff, not a bunch of moderators.
Lesson learned from the reharvest of nucl journals, where we used magpie to guess the subject. 10% I dumped right away, for about 20% I went through and made corrections, about 70% were useful. Which means even if we run magpie over everything that doesn't come from arXiv we still have ~7k/y to assign manually.
@ksachs thanks for you comments. I agree that having categories is very useful, I just think those categories should be closer to the arXiv ones that people know. In this way we can avoid having our users learn and remember what from their point of view is a new classification scheme which is very similar yet subtly different in some areas.
The main use-case was that people would browse (we even had printed lists) through the daily inputs in their category. And from the arXiv listings I believe this is still the case.
People browse the new arXiv listings (e.g. hep-th/new), it would actually be very nice if one could do the same on INSPIRE, with all new papers in a given category. So by adopting arXiv categories on INSPIRE, one could get, say, arXiv hep-th + non-arXiv hep-th additions.
Btw: other = non-physics
I know that Other = None of the rest, but in order to know what it means precisely one has to know all the other categories that we have
You forget that the INSPIRE content is not the same as the arXiv content. Only about half of our records come from arXiv. And we have only a very selective part of arXiv. Where arXiv categories are split in subcategories we barely have records.
That's why it would be good to facet in a hierarchical way, so that we can have the full math category collapsed into math at first, but with possibility to expand it if one is interested in math.RT but not math.AP.
The assignment of categories has to be feasible which it is not if the categories are too finegrained. It's mainly Florian and me who assign categories to the non-arXiv stuff, not a bunch of moderators.
I was mentioning moderators for the arXiv content. For non-arXiv, we could do it ourselves, but based on the arXiv top-level categories if it's something we don't care about (so physics or math or cs or ...).
Lesson learned from the reharvest of nucl journals, where we used magpie to guess the subject. 30% I dumped right away, for about 20% I went through and made corrections, about 50% were useful. Which means even if we run magpie over everything that doesn't come from arXiv we still have ~10k/y to assign manually.
What I have seen of magpie was very impressive. I would guess that magpie didn't have good training data in this case. By adopting arXiv categories, we could very easily train magpie on the arXiv corpus.
Currently INSPIRE categories are stored at the same level of arXiv categories and other externally provided categories.
However, while cataloger are not supposed to touch externally provided categories, they are instead expected to curate INSPIRE categories. This poses INSPIRE categories in a privileged place.
It is proposed to move INSPIRE categories on a dedicated field, so that they are easier to edit (e.g. autocomplete can be enforced on the exact INSPIRE categories).