inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

RFC Citations 101 #549

Open kaplun opened 8 years ago

kaplun commented 8 years ago

Problem

Records are linked to each other in several ways:

We then have 3 possible combinations:

It is proposed that the total number of citations to A corresponds to the size of the union of all the documents citing either A or B. Hence len(C, D, E) -> 3

At display time, we might want to distinguish between records explicitly linking to A from those indirectly linking to it. (E would be printed in the first group)

It is proposed that the total number of citations to B corresponds to the explicit number of citations to B only. Hence len(D, E) -> 2

At display time, we might want to still present records explicitly citing to B and also record that citing A but not B.

What counts a citations for a group G of documents including A & B?

It is proposed that this is defined as the size of the union of all the documents citing at least one of the documents in G.

How to count outgoing citations from A & B

In the case of the A superseding B, if we are counting up the number of citations outgoing from document A & B, we should only actually considering those getting out from A and ignore B. That is citations in general should only be considered only if coming from document that are not superseded. (what happens if citation from A have not been computed though?)

Note, for book chapters and conference contribution, we could still keep counting outgoing citations, since in general we don't capture citations from books or proceedings.

CC: @eamonnmag, @jmartinm, @tomaszgy, @annetteholtkamp

salmele commented 8 years ago

I am not sure I understand the examples, and can think of counter-examples to do the opposite.

1) You suggest that if A is a chapter of book B, and D cites book B, A inherits in all cases this citation.

The counter-example is that the A as a chapter can be on foo, while another totally unrelated chapter in book B is on bar. Then in the text of the article D we have a string "As demonstrated in book B [citation] concerning bar...". The attribution of the citation to A is of course wrong.

Another counter example is like you had a journal of several articles, and every time someone refers to the name of the journal, you give a +1 to each article, which is clearly not correct.

2) You suggest that if A is a conference contribution in proceeding B, and D cites the proceeding B, A inherits in all cases this citation.

Counter example is very similar. Proceeding B includes a contribution A where Alice has discovered the Citationino and contributions from Bob which has demonstrated that the Citationino does not exist. D cites "conference B where the absence of Citationino was demonstrated" and you cannot have Alice's paper be credited for the opposite concept.

3) You suggest that if A supersedes B, and D cites old paper B, then A inherits in all cases this citation.

This seems to break the scholarly record and consecutio temporis. Assume that D is written in 2016. if A is written in 2017 superseding B which is written in 2015, it is bizarre that D would cite a 2017 paper... what would happen to @eamonnmag graph BTW?

eamonnmag commented 8 years ago

@salmele the graph would still plot the reference, but you would see a forward reference in the last case. We already have some examples where this happens.

In the proceedings example, this is a bizarre use case for me. When would one cite a proceedings or journal in full? Is it a physics thing?

In my opinion, which isn't so versed in the ways of citations, I see each version of a paper or object in general as being a citable product, and this is how DOIs (should) work. You have different DOIs for each version of a paper, so you can track who cited which version of an artefact. Translating that to citation counts, even if B supersedes A, they may be quite different, and therefore should be treated as separate objects with individual citation counts. On the front end, you can show the different versions and either aggregated or version specific counts, but by storing them individually from the outset, you avoid any problems.

For the books, this is tricky. When people cite books, they rarely cite individual chapters even though the primary source comes from 2 chapters out of say 10. So 20% of the book was useful to them and used in the research, but the rest weren't. So saying the other 8 chapters are equivalent could be wrong. However because there is often ambiguity in book citations, there is often no way to know this. Again though, it would be better when storing all counts, they are stored individually for the books and the chapters. On the front end, these can either be again shown as aggregates or separately. But at least the raw information is as correct as it can be given the unfortunate incompleteness and uncertainty around such data.

As a technical aside, in graph databases, we can easily store different node and relationship types so we can capture relationships between articles (versions), book chapters (to a book), or proceedings to a collection of articles. But by adding complexity to the graph, we'd be taking a performance hit.

kaplun commented 8 years ago

1) You suggest that if A is a chapter of book B, and D cites book B, A inherits in all cases this citation.

Correctly spot. Indeed this is proposed only in the relation A supersedes B. For is part of this does not make sense, indeed. I'll amend the original description. Ditto for the conference proceeding use case.

3) You suggest that if A supersedes B, and D cites old paper B, then A inherits in all cases this citation.

Good point. Note that we need to distinguish here where things would be presented:

kaplun commented 8 years ago

Again though, it would be better when storing all counts, they are stored individually for the books and the chapters. On the front end, these can either be again shown as aggregates or separately. But at least the raw information is as correct as it can be given the unfortunate incompleteness and uncertainty around such data.

:+1: Indeed that was maybe not explicit in my 101, but yes, the raw data should store the exact links, while upon displaying we can aggregate according to what makes more sense and deliver the most service.

eamonnmag commented 8 years ago

:+1:

salmele commented 8 years ago

in A: total number of citations would be the union of the citations. However when displaying citation in A, I would distinguish between explicit citations, and inherited citations.

I am not sure that this behavior is the one expected by the community (of the readers, not of the authors of the entities receiving a larger number of citations). @annetteholtkamp @tsgit (and the rest of the team not on GitHub) this should be discussed in the contest of the AB.

kaplun commented 8 years ago

(Just for reference: these stem from discussions with @annetteholtkamp in order to properly classify papers that superseded notes - e.g. ATLAS/CMS)

annetteholtkamp commented 8 years ago

Indeed we have to clearly distinguish between superseded papers on one hand and proceedings/books on the other.

The discussion about superseded papers was started by ATLAS - to eliminate over counting of citations. On their request we don’t take citations to the superseded paper into account at all. And to avoid setting up something specific just for the LHC collaborations we currently have the policy to discard all citations to superseded papers - which is not very transparent to the users. So maybe we can come up with something smarter.

For books and proceedings it makes sense to me to show the aggregation of citations. Extremely useful for RPP - it would allow us to finally create separate records for all reviews. Which we currently can’t do because the PDG people need the aggregated citation count.

On 30 Nov 2015, at 15:12, Salvatore Mele notifications@github.com wrote:

in A: total number of citations would be the union of the citations. However when displaying citation in A, I would distinguish between explicit citations, and inherited citations.

I am not sure that this behavior is the one expected by the community (of the readers, not of the authors of the entities receiving a larger number of citations). @annetteholtkamp @tsgit (and the rest of the team not on GitHub) this should be discussed in the contest of the AB.

— Reply to this email directly or view it on GitHub.

salmele commented 8 years ago

For books and proceedings it makes sense to me to show the aggregation of citations. Extremely useful for RPP - it would allow us to finally create separate records for all reviews. Which we currently can’t do because the PDG people need the aggregated citation count.

But this does not seem what @kaplun is suggesting the RFC.

If book/proceeding A has chapters B, C, D, ....Z then is a great idea that the individual record of book/proceeding A can show the sum of the citations which went to B, C, D, Z, in addition to those of A stand alone.

But giving to B all citations of A, when A was cited standalone, it seems a non-sequitur.

kaplun commented 8 years ago

@salmele I have added the specification that the counting is considering the union only in the case of the supersede relation, not in the case of the is part of relation.

kaplun commented 8 years ago

Actually I realize the mistake in my RFC. I swapped the role of A and B for books and proceeding. I think in this way everything is correct.

kaplun commented 8 years ago

Note to @pherterich and @suenjedt : similar relation could be thought for Data records that relates to a paper. This is similar to the discussion of the book chapter Vs. the Book. Although we might decide that a citation to the data does not propagate to the corresponding paper.

bing13 commented 8 years ago

@eamonnmag said:

You have different DOIs for each version of a paper, so you can track who cited which version of an artefact ...

Just to be clear, whether an new article or book version receives a new DOI is up to the publisher, etc., and there is certainly variability in practices. FWIW.

StellaCh commented 6 years ago

Has this been addressed @kaplun @jacquerie ?

kaplun commented 6 years ago

Nope. We have for the time being a very naïve citation counting algorithm.