inspirehep / inspire-schemas

Inspire JSON schemas and utilities to use them.
GNU General Public License v2.0
8 stars 26 forks source link

Privilege INSPIRE categories #41

Open kaplun opened 7 years ago

kaplun commented 7 years ago

Currently INSPIRE categories are stored at the same level of arXiv categories and other externally provided categories.

However, while cataloger are not supposed to touch externally provided categories, they are instead expected to curate INSPIRE categories. This poses INSPIRE categories in a privileged place.

It is proposed to move INSPIRE categories on a dedicated field, so that they are easier to edit (e.g. autocomplete can be enforced on the exact INSPIRE categories).

rikirenz commented 7 years ago

cc: @david-caro

kaplun commented 7 years ago

So we could split it in INSPIRE_categories and external_categories. INSPIRE_catagories will not need any scheme or source. So just list of strings.

david-caro commented 7 years ago

I'm always confused about caps and lowercase, isn't it better to keep everything in lowercase? (As if they were variable names in python)

kaplun commented 7 years ago

Sure!

kaplun commented 7 years ago

@inspirehep/inspire-content @inspirehep/inspire-dir

Given the following facts:

Given the following thoughts:

~It is proposed here to simplify the data model and hence curation inteface in the following way:~ ~ store only INSPIRE categories~ ~ upon ingestion external categories (e.g. arXiv) are mapped to INSPIRE ones~ ~ if no external category we use magpie to guess one based on abstract~ ~ upon migration from legacy we preserve INSPIRE categories and we map existing arXiv ones into INSPIRE.~ ~* Present to users a facet based on INSPIRE categories.~

Edit: See below

david-caro commented 7 years ago

So we only use magpie if there are no external categories?

kaplun commented 7 years ago

If we don't have any INSPIRE category (they might come through a different mapping in hepcrawl, for example)

salmele commented 7 years ago

I strongly disagree.

Our user community understands what hep-th means, for instance, and ought to be able to search accordingly

Possibly, we've a 'topic' search, which has all of INSPIRE present 'categories'. And arXiv-category search, which is meant to work only on arXiv content, and not second-guessed INSPIRE attribution

jacquerie commented 7 years ago

I think @michamos can contribute his idea here.

kaplun commented 7 years ago

OK: New proposal:

annetteholtkamp commented 7 years ago

I agree that we should definitely store the arxiv categories. I'm not so sure whether they should appear as facets. As Sam already pointed out only a subset of records has an arXiv category. I assume it wouldn't be evident to all users that focusing on hep-th e.g. would give only arXiv papers. And why should we single out arXiv papers anyway?

On Dec 2, 2016, at 4:48 PM, Samuele Kaplun notifications@github.com<mailto:notifications@github.com> wrote:

OK: New proposal:


Notes: in principle all papers/records of INSPIRE should belong to at least one INSPIRE category. On the contrary only records harvested from arXiv will belong to a given arXiv categories.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHubhttps://github.com/inspirehep/inspire-schemas/issues/41#issuecomment-264485570, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AM1-O7s_WCCuaWCz_fbT9NU9MneLrGHEks5rED3VgaJpZM4K2QIg.

bing13 commented 7 years ago

I agree. But there might be some non-misleading way to expose such a feature with clever UI, etc.

On Fri, Dec 2, 2016 at 10:52 PM, annetteholtkamp notifications@github.com wrote:

I agree that we should definitely store the arxiv categories. I'm not so sure whether they should appear as facets. As Sam already pointed out only a subset of records has an arXiv category. I assume it wouldn't be evident to all users that focusing on hep-th e.g. would give only arXiv papers. And why should we single out arXiv papers anyway?

  • Annette

On Dec 2, 2016, at 4:48 PM, Samuele Kaplun <notifications@github.com< mailto:notifications@github.com>> wrote:

OK: New proposal:

  • we store arXiv categories (originated from arXiv) into arXiv categories
  • we store INSPIRE categories (or topics or subjects) into INSPIRE categories:
  • upon migration from legacy: copying them if they already exist
  • upon migration from legacy: generating them from arXiv categories when available
  • using magpie to guess them from the abstract (or title) in all the other cases.
  • We propose to present to user 2 separate facets (initially folded, in order not to waste too much space):
  • INSPIRE categories/topics/subjects (easy): human-friendly names
  • arXiv categories: possibly presented into a foldable tree structure to represent the complex hiearchy of arXiv. Anyway only categories for which we have papers will be displayed. (in principl

Notes: in principle all papers/records of INSPIRE should belong to at least one INSPIRE category. On the contrary only records harvested from arXiv will belong to a given arXiv categories.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ inspirehep/inspire-schemas/issues/41#issuecomment-264485570, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AM1-O7s_WCCuaWCz_ fbT9NU9MneLrGHEks5rED3VgaJpZM4K2QIg.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/inspirehep/inspire-schemas/issues/41#issuecomment-264621927, or mute the thread https://github.com/notifications/unsubscribe-auth/ABGK0a2VZZTdcztEYR22mLVe2Z4pA9YPks5rERGrgaJpZM4K2QIg .

aw-bib commented 7 years ago

might be some non-misleading way

Sort of a hierarchical facet comes to mind. First arXiv, then it's categories.

BTW: Quite often one does not search for a term, but having the keyword on display gives additional clues about the content. hep-th is usually a different sort of paper than hep-ph or hep-ex. IOW it's not only computers looking at the data, thus IMHO bean counting alone doesn't suffice.

suenjedt commented 7 years ago

I think this is a perfect example where more data/evidence is needed, as well as more testing - at least for the user facing part. I would recommend consulting Stella for this.

david-caro commented 7 years ago

Okok, for now, we all seem to agree on storing arXiv + inspire categories, just not yet sure how to expose that info to the users right? (to @rikirenz can keep working on the backend part)

ksachs commented 7 years ago

Florian and I delete INSPIRE categories derived from arXiv categories if they don't fit. Usually these are cross-listings which on arXiv mean 'also of interest for...' whereas on INSPIRE they are equivalent to the main category.

Concerning facets: I would use only INSPIRE categories (gut feeling, no strong opinion). If people really want to search for arXiv categories they can still do that explicitly - right?

kaplun commented 7 years ago

Florian and I delete INSPIRE categories derived from arXiv categories if they don't fit.

Who/what process has added the wrong INSPIRE category in the first place? We currently have a mapping that unambiguously maps an arXiv category to an INSPIRE one:

https://github.com/inspirehep/inspire-next/blob/master/inspirehep/config.py#L1445-L1604

Concerning facets: I would use only INSPIRE categories (gut feeling, no strong opinion). If people really want to search for arXiv categories they can still do that explicitly - right?

In principle both searches are possible, but we have stats demonstrating that users are not able to search with them, via search syntax. On the other hand facets are very easy to manipulate. Indeed there is no hurry on this, so we can indeed see the results of user testing the alternatives.

ksachs commented 7 years ago

Not wrong in the sense of 'wrong mapping', but wrong in the sense of 'this is not about Experiment-HEP'. At arXiv articles get cross-listed for better visibility 'also of interest for...', at INSPIRE the subject always means 'is about', we don't have the notion of secondary category.

kaplun commented 7 years ago

In the new model a record can have more than one INSPIRE subject, so crosslisting is preserved.

kaplun commented 7 years ago

IMHO users will not be damaged by having a record to belong to multiple categories. It's just a way for a paper to be reached. Once reached, the user will be able to make is own judgement. But indeed we are making several assumptions here. So worth doing a user-checking.

ksachs commented 7 years ago

side effect: everything with a INSPIRE category *-HEP will (automatically) get CORE. I don't need / want user-checking to know that The Escaramujo Project: instrumentation courses during a road trip across the Americas is not CORE. And I want to be in control of INSPIRE subjects even for arXiv papers. If you keep arXiv and INSPIRE categories in sync we completely depend on arXiv. So: derive INSPIRE category automatically from arXiv on ingestion: perfect. But later on they should be independent.

david-caro commented 7 years ago

@ksachs I think that is the idea, the INSPIRE categories are only guessed/automatically added if they are not there already, and only on ingestion. After that, they can be manually changed if needed.

The point that @kaplun is trying to make, is that we can have multiple INSPIRE ones, just as we have multiple arXiv ones, with the same treatment to main category and multiple secondary ones.

kaplun commented 7 years ago

Yeah, plus I was really curios to know if there was maybe something we could correct, since you were mentioning that you were actually removing INSPIRE categories. Fully understood now.

jacquerie commented 7 years ago
  • In the last 6 months only ~0.01% users queried INSPIRE using arXiv categories (or any category at all)

Correcting myself: the ratio of

SELECT count(*)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
  type=8 AND
  idaction_name=idaction AND
  name LIKE '% fc %'
);

and

SELECT count(*)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
  type=8 AND
  idaction_name=idaction
);

is 0.07%. Still low, but mostly I wanted to write down how to compute this kind of number if someone else is interested.

Also: this is the ratio of queries containing fc, not users. Computing the ratio of users can be done, but is slightly harder.

jacquerie commented 7 years ago

It's actually pretty easy. The ratio of:

SELECT COUNT(DISTINCT idvisitor)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
  type=8 AND
  idaction_name=idaction AND
  name LIKE '% fc %'
);

and

SELECT COUNT(DISTINCT idvisitor)
FROM piwik_log_action, piwik_log_link_visit_action
WHERE (
  type=8 AND
  idaction_name=idaction
);

is 0.1%.

That is, of all distinct users that made at least a query on INSPIRE in the last six months, 0.1% of them used the fc keyword.

kaplun commented 7 years ago

@annetteholtkamp, rightly points out, of course, that it does make quite a difference to know if an INSPIRE category had been:

So finally in our data structure seems like we still need to capture the source for INSPIRE categories.

david-caro commented 7 years ago

If the inspire categories guess is done only on ingestion and only if not there already, I don't think that it adds much value to track the source of the categories, let me explain (it's very possible that I just don't get your point ;), I don't see @annetteholtkamp rationale here).

The arxiv keywords come from outside, and will never change, so letting them have their own section already expresses the source (passive info).

The inspire ones, we just care if they are right or not, if they are not right, we just change them, no matter if they come from magpie, if they came from arxiv or from a cataloger no?

I only see a couple of use cases where we would want that info, to train better magpie by comparing the result it gave with the one from the cataloger, or something like getting a list of all the records with automated categories... though I don't see that being very useful, as we can just train magpie with the whole set of data just expecting it to be correct (as it passed validation from a cataloger).

kaplun commented 7 years ago

How do you envisage a workflow where a cataloger check a record only 2 weeks after it arrives on INSPIRE. At that point we already have ingested it, but the INSPIRE category has been already guessed by arXiv/magpie. How the cataloger can now if it can trust the INSPIRE category?

Beside this small point I agree with you that in general we don't care between arXiv Vs. magpie.

david-caro commented 7 years ago

Well, it does not matter who put it no? Just if it is correct or not, and for that you don't care of the origin right? Unless, there's some extra work associated with checking the correctness itself. In that case it makes sense to add that 'already verified' flag.

ksachs commented 7 years ago

It's about the trust level. Up to now the only sources for subject categories where catalogers (Annette, Florian and me) or arXiv (obvious to spot). If we have more sources (magpie, journal category, ...) it would be nice to know whether it is worth a second look.

salmele commented 7 years ago

Excellent point, @ksachs. This calls for the notion of provenance (for later audit should the case be) in the record. Didn't Invenio offer something on that?

kaplun commented 7 years ago

@salmele nothing out of the box. It's up to us to decide what provenance to store for what and where, and to implement support for it.

Invenio supports basic history, information, such as this version of record X was touched by curator Y, but we don't know which field is actually involved.

OK. So all in all, we don't simplify data-model WRT INSPIRE subjects, but we keep them as list of objects with two attributes: source (enum with curator, magpie, arxiv) and value (enum among the current list of subjcets).

BTW how shall we call it? inspire_subjects or _categories or _topics?

david-caro commented 7 years ago

@salmele Quick question, what should we do when:

This (and choosing the name) is currently blocking @rikirenz on #47.

salmele commented 7 years ago

I am concerned to preserve arXiv categories for arXiv-ingested preprints for user to (faceted) search for them, as they of course are aware of what arXiv categories are.

I am neutral about what to do with INSPIRE subjects and I recommend @ksachs or @annetteholtkamp decide how they want to define the ontology of sources and their priority

bing13 commented 7 years ago

i concur.

On Tue, Dec 6, 2016 at 4:07 PM, Salvatore Mele notifications@github.com wrote:

I am concerned to preserve arXiv categories for arXiv-ingested preprints for user to (faceted) search for them, as they of course are aware of what arXiv categories are.

I am neutral about what to do with INSPIRE subjects and I recommend @ksachs https://github.com/ksachs or @annetteholtkamp https://github.com/annetteholtkamp decide how they want to define the ontology of sources and their priority

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/inspirehep/inspire-schemas/issues/41#issuecomment-265314279, or mute the thread https://github.com/notifications/unsubscribe-auth/ABGK0XVH1wuD0CLOcYS549sI3e2cTMhcks5rFfilgaJpZM4K2QIg .

kaplun commented 7 years ago

@david-caro IMHO there's not going to be arXiv+magpie (since magpie is used only in case of no arXiv). and cataloger wins over the two. So in case there is same term for arXiv and cataloger we preserve the cataloger one.

annetteholtkamp commented 7 years ago

Actually, we could run magpie also on arXiv papers and keep magpie’s suggestion in case of difference to arXiv. This could be helpful for the content people as a hint to carefully check the subject.

On 7 Dec 2016, at 08:58, Samuele Kaplun notifications@github.com wrote:

@david-caro https://github.com/david-caro IMHO there's not going to be arXiv+magpie (since magpie is used only in case of no arXiv). and cataloger wins over the two. So in case there is same term for arXiv and cataloger we preserve the cataloger one.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/inspirehep/inspire-schemas/issues/41#issuecomment-265380839, or mute the thread https://github.com/notifications/unsubscribe-auth/AM1-OxcZWXYAuPj0fSbTqwXTzgDIWuTxks5rFmcWgaJpZM4K2QIg.

salmele commented 7 years ago

For the avoidance of doubt: using magpie or human classification on arXiv would create a new INSPIRE subject and NOT modify the arXiv category (and cross-listing) as chosen by the submitter (which we'd show somewhere else)... right?

kaplun commented 7 years ago

@salmele right: arXiv categories are never going to be touched.

david-caro commented 7 years ago

So I propose then having this:

kaplun commented 7 years ago

Do we want to go for the undefined? In general so far, when something is undefined we simply haven't specified a value. In Invenio 1 this had some implications because it wasn't possible to search for those things having no-value. Not sure for Elasticsearch. @jacquerie ?

salmele commented 7 years ago

A field arxiv_categories with the categories list from arxiv, that are just populated on ingestion

...preserving the concept of primary category in arXiv (current PRIMARC in SPIRES syntax and dedicated MARC field)

kaplun commented 7 years ago

Yup. This is something that can be implemented at search time. The first arXiv category in the list of arXiv categories will be searchable also with primarc.

Ticketized in inspirehep/inspire-next#1791 so we don't forget)

michamos commented 7 years ago

TL;DR: adopt arXiv categories instead of INSPIRE categories

My opinion is rather different from those expressed before. So let me summarize some of the different viewpoints as far as I understand them:

One point of view is largely missing, the user point of view, which is probably the most important. I was a simple user until not too long ago, and my point of view is that INSPIRE categories are terrible, and I would suggest getting rid of them in their current form, for the following reasons.

nobody knows about them

as @jacquerie showed, ~0.1% of our users ever tried to perform a search with the 'fc' keyword. Of those, I am sure a large fraction is doing it wrong, e.g. fc hep-th instead of fc t. To the contrary, arXiv categories have been user-visible for 25 years and everyone knows what hep-th means. This goes to the extent that even inside INSPIRE, we use arXiv categories for HEPNAMES and jobs instead of the INSPIRE ones. So we don't have any legacy to preserve here, and we can use the migration as an opportunity to rethink the whole concept. I would suggest sticking as closely as possible to the arXiv categories that the people are familiar with. (For anecdotal evidence, I had never heard of INSPIRE categories as a user, and am still struggling to remember the less common ones.)

they are confusing

there are three types of categories:

See the list on the bottom of the INSPIRE categories and the arXiv categories that map to it. They are inconsistent, as General Physics is NOT the same as physics.gen-ph (which is a euphemism for garbage on the arXiv), and Data Analysis and Statistics does not contain the stat.* statistics categories. And who knows what Other means?

they are largely useless

the purpose of categories is to organize the records in such a way that the user can easily filter them out and focus on what she is interested in. Either she is interested in a small category, in which case she is looking conceptually for a single arXiv category, but with another name, or she is not. In the latter case, the category is probably useless anyway. For example, if someone wants to know more about entanglement entropy (which is a hot subject right now in hep-th, but was born in quant-ph and is used quite a lot in cond-mat), he is not interested whether the paper is General Physics, but in the distinction between quant-ph and cond-mat.* ). More importantly, our coverage will probably be quite poor in those areas, but the user is not necessarily aware of this fact.

solution: adopt arXiv categories

So I would suggest overhauling INSPIRE categories to match arXiv categories. For arXiv papers, we adopt the arXiv category as the INSPIRE category (keeping the distinction primary/secondary also on INSPIRE, and allowing the user to search/facet based on primary only or both primary and secondary). For records that do not come from arXiv, we put them in a specific category (e.g. physics.acc-ph) if we care about it, or we put them in a top-level category (e.g. astro-ph or math) if we don't want to zoom in.

If we keep the primary/secondary distinction, I don't think there would be any need for overriding categories (maybe @ksachs still sees a usecase), as we can trust the arXiv moderators which vastly outnumber the handful of people assigning categories in INSPIRE and are active researchers in their fields. But we could still have the provenance field just to be sure, and for the case in which we add a non-arXiv record that is later added to arXiv. And assigning articles to top-level categories should not be too difficult, and the number of them still manageable for magpie (with the added benefit that we could train it on arXiv data).


{'Accelerators': ['physics.acc-ph'],
 'Astrophysics': ['physics.space-ph',
  'astro-ph.EP',
  'astro-ph.GA',
  'astro-ph.SR',
  'astro-ph.CO',
  'astro-ph',
  'astro-ph.HE'],
 'Computing': ['cs.ET',
  'cs.RO',
  'cs.CY',
  'cs.NE',
  'cs.DB',
  'cs.DC',
  'cs.SI',
  'cs.DL',
  'cs.CV',
  'cs.FL',
  'cs.SD',
  'cs.PF',
  'cs.LG',
  'cs.DS',
  'cs.OH',
  'cs.OS',
  'cs.LO',
  'cs.MM',
  'cs.AI',
  'physics.comp-ph',
  'cs.GT',
  'cs.IR',
  'cs.NA',
  'cs.SE',
  'cs.CL',
  'cs.CG',
  'cs.DM',
  'cs.SY',
  'cs.GL',
  'cs.IT',
  'cs.CR',
  'cs.MS',
  'cs.SC',
  'cs.CC',
  'cs.AR',
  'cs.GR',
  'cs.NI',
  'cs.MA',
  'cs.PL',
  'cs.CE',
  'cs.HC',
  'cs'],
 'Data Analysis and Statistics': ['physics.data-an'],
 'Experiment-HEP': ['hep-ex'],
 'Experiment-Nucl': ['nucl-ex'],
 'General Physics': ['quant-ph',
  'cond-mat.stat-mech',
  'cond-mat.mes-hall',
  'cond-mat.supr-con',
  'physics.plasm-ph',
  'cond-mat',
  'cond-mat.other',
  'physics.class-ph',
  'nlin',
  'cond-mat.quant-gas',
  'cond-mat.dis-nn',
  'nlin.CD',
  'cond-mat.soft',
  'nlin.CG',
  'cond-mat.str-el',
  'physics.ao-ph',
  'cond-mat.mtrl-sci',
  'physics.gen-ph',
  'nlin.AO',
  'physics.atm-clus',
  'physics.flu-dyn',
  'physics.atom-ph',
  'physics.optics',
  'physics',
  'physics.geo-ph'],
 'Gravitation and Cosmology': ['gr-qc'],
 'Instrumentation': ['astro-ph.IM', 'physics.ins-det'],
 'Lattice': ['hep-lat'],
 'Math and Math Physics': ['patt-sol',
  'math.GT',
  'math.CV',
  'math.MP',
  'math.GM',
  'math.PR',
  'math.GR',
  'math.DG',
  'math.NA',
  'math.AP',
  'math.CA',
  'math.LO',
  'math.NT',
  'math.AG',
  'math.KT',
  'q-alg',
  'math.ST',
  'math.CT',
  'math.QA',
  'alg-geom',
  'math',
  'math.DS',
  'math.FA',
  'math.CO',
  'math.SP',
  'math.MG',
  'math.GN',
  'math.AT',
  'nlin.PS',
  'math.OC',
  'math.SG',
  'math.HO',
  'math.RT',
  'math.IT',
  'math.RA',
  'math.OA',
  'math-ph',
  'dg-ga',
  'math.AC',
  'solv-int',
  'nlin.SI'],
 'Other': ['q-fin.TR',
  'q-bio.PE',
  'q-bio.CB',
  'q-bio.BM',
  'q-fin.GN',
  'q-fin.PR',
  'stat.AP',
  'physics.chem-ph',
  'physics.pop-ph',
  'q-bio.MN',
  'stat.CO',
  'stat.ML',
  'physics.hist-ph',
  'q-fin.CP',
  'stat.OT',
  'q-bio.TO',
  'q-fin.EC',
  'q-bio.GN',
  'q-fin.PM',
  'physics.med-ph',
  'stat.TH',
  'physics.bio-ph',
  'q-bio.SC',
  'physics.soc-ph',
  'physics.ed-ph',
  'q-bio.OT',
  'q-bio.QM',
  'q-fin.ST',
  'q-bio.NC',
  'q-fin.RM',
  'q-fin.MF',
  'stat.ME'],
 'Phenomenology-HEP': ['hep-ph'],
 'Theory-HEP': ['hep-th'],
 'Theory-Nucl': ['nucl-th']}
david-caro commented 7 years ago

IMO we are guessing too much, the inspire categories have been hidden so far so we don't have usage data, I'd try to get some from the users before doing any big effort either way (that might require some effort too, but having the machinery to easily get that kind of things will help on deciding on other features too).

kaplun commented 7 years ago

@michamos for us developers, can you exactly define or point us to the definition of what is:

E.g. with some concrete example to reasoning on.

michamos commented 7 years ago

@michamos for us developers, can you exactly define or point us to the definition of what is:

primary arXiv category

the thing we put into 037, e.g. for https://inspirehep.net/record/1501963 it is hep-th secondary arXiv category for the same record, they are cond-mat.stat-mech and physics.flu-dyn top-level arXiv category a category in bold on https://arxiv.org/, which might have subcategories, so hep-th or physics. (probably using `mathinstead ofmath.*though to take alsomath-ph`). E.g. with some concrete example to reasoning on.

michamos commented 7 years ago

IMO we are guessing too much, the inspire categories have been hidden so far so we don't have usage data

the fact they are hidden and completely unkwown is an opportunity to rethink INSPIRE categories IMHO

ksachs commented 7 years ago

Hi Micha,

thanks for your detailed input from the user side. Point taken - visibility is an issue. A pitty that we don't have usage statistics from SPIRES - that's our legacy. Both field-codes (= subject, category, ...) and keywords are neglected on INSPIRE.

However you are talking about something you barely know (as you say yourself). The main use-case was that people would browse (we even had printed lists) through the daily inputs in their category. And from the arXiv listings I believe this is still the case.

First some statistics - records added the last 2 years:

  year   2016 / 2015 / 2015core
  all   76795 / 51412 / 28802
  arxiv 22558 / 25454 / 17228
  a     10595 / 10957 / 4982
  b      6667 / 6007 / 1494
  c       912 / 610 / 351
  e      5313 / 5311 / 5087
  g      5941 / 6463 / 4479
  i      5143 / 5086 / 3171
  l      1063 / 1205 / 1202
  m      2718 / 3160 / 1548
  n    19345* / 4451 / 2064
  o       422 / 503 / 169
  p      9304 / 9008 / 8750
  q      3894 / 4194 / 1692
  t      6819 / 7422 / 7304
  x    17809* / 3717 / 1568
  * reharvest of nucl journals

Btw: other = non-physics

We have one big category (with a lot of non-core stuff we don't want to waste time on): astro, two small category: lattice and computing, all the rest is around 5k/y. Which shows that the balance is not too bad.

You forget that the INSPIRE content is not the same as the arXiv content. Only about half of our records come from arXiv. And we have only a very selective part of arXiv. Where arXiv categories are split in subcategories we barely have records. For core records the big categories are p (9k), t (7k) and e (5k) which we might want to break down into smaller sub-categories; which SPIRES used to have, but that was before my time. Actually SPIRES had categories before arXiv existed.

The assignment of categories has to be feasible which it is not if the categories are too finegrained. It's mainly Florian and me who assign categories to the non-arXiv stuff, not a bunch of moderators.

Lesson learned from the reharvest of nucl journals, where we used magpie to guess the subject. 10% I dumped right away, for about 20% I went through and made corrections, about 70% were useful. Which means even if we run magpie over everything that doesn't come from arXiv we still have ~7k/y to assign manually.

michamos commented 7 years ago

@ksachs thanks for you comments. I agree that having categories is very useful, I just think those categories should be closer to the arXiv ones that people know. In this way we can avoid having our users learn and remember what from their point of view is a new classification scheme which is very similar yet subtly different in some areas.

The main use-case was that people would browse (we even had printed lists) through the daily inputs in their category. And from the arXiv listings I believe this is still the case.

People browse the new arXiv listings (e.g. hep-th/new), it would actually be very nice if one could do the same on INSPIRE, with all new papers in a given category. So by adopting arXiv categories on INSPIRE, one could get, say, arXiv hep-th + non-arXiv hep-th additions.

Btw: other = non-physics

I know that Other = None of the rest, but in order to know what it means precisely one has to know all the other categories that we have

You forget that the INSPIRE content is not the same as the arXiv content. Only about half of our records come from arXiv. And we have only a very selective part of arXiv. Where arXiv categories are split in subcategories we barely have records.

That's why it would be good to facet in a hierarchical way, so that we can have the full math category collapsed into math at first, but with possibility to expand it if one is interested in math.RT but not math.AP.

The assignment of categories has to be feasible which it is not if the categories are too finegrained. It's mainly Florian and me who assign categories to the non-arXiv stuff, not a bunch of moderators.

I was mentioning moderators for the arXiv content. For non-arXiv, we could do it ourselves, but based on the arXiv top-level categories if it's something we don't care about (so physics or math or cs or ...).

Lesson learned from the reharvest of nucl journals, where we used magpie to guess the subject. 30% I dumped right away, for about 20% I went through and made corrections, about 50% were useful. Which means even if we run magpie over everything that doesn't come from arXiv we still have ~10k/y to assign manually.

What I have seen of magpie was very impressive. I would guess that magpie didn't have good training data in this case. By adopting arXiv categories, we could very easily train magpie on the arXiv corpus.