ltalirz / atomistic-software

Tracking citations of atomistic simulation engines
https://atomistic.software
GNU Affero General Public License v3.0
19 stars 12 forks source link

Improving accuracy of citation counts #114

Closed godotalgorithm closed 2 years ago

godotalgorithm commented 2 years ago

I am maintaining MOPAC, and as part of this activity, I've been exploring how to track its citations more accurately. My best attempt at this using Google Scholar data is: https://openmopac.github.io/_images/plot.pdf . There are two things to note: my estimate of citation counts to MOPAC are about twice what is in your database, and I have error bars.

My basic methodology has been to make the search criteria as broad as possible and then to estimate a false positive rate by sampling the data by hand. The premise is that false positives of an overly generous search are easier to quantify than false negatives of an overly narrow search. In MOPAC's case, I estimated false positive rates for two different, separated periods of time, and found them to be very similar. Thus, I suspect that it is safe to use a constant false positive rate for each code, but each code probably has a different false positive rate because some codes are more successful at inducing standardized citations than others.

Would you be interested in adapting this sort of methodology for atomistic.software? It would require an entry in codes.json for the false positive citation rate for each code, and you may need to adjust query_string somehow to accommodate more complicated nested logical expressions for the Google Scholar search strings. If you are willing to make the appropriate backend changes, then I'd be willing to go through the database entries to expand the search strings and estimate the false positive rates.

ltalirz commented 2 years ago

Dear Jonathan,

thank you very much for this thoughtful suggestion, as well as your offer to help - I really appreciate it! I think your suggestion is certainly worth considering, let me start with some explanation of the status quo, and then turn to your suggestion.

You rightfully point out that the current approach on atomistic.software of searching for both the name of a key author (or authors) and the name of the code excludes certain mentions of the code. This can happen, for example, when the name of the code author was not mentioned (no traditional "citation"), or when the reference section of the article was not indexed (e.g. code mentioned in title or abstract but full text index missing). For some codes this difference can indeed become quite substantial, e.g. I've seen a factor of 1.5 for LAMMPS [1].

This approach was inherited from the first static incarnation of the list and, I presume, was a natural evolution from wanting to count traditional citations. On the plus side, this puts the codes on almost [2] the same footing and one can almost [3] exclude false positives. On the negative side, you are correct that there is a real danger of false negatives, e.g. when authors start citing a new article/version of the software that does not match the current search query.

Turning to your suggestion.

My basic methodology has been to make the search criteria as broad as possible and then to estimate a false positive rate by sampling the data by hand. The premise is that false positives of an overly generous search are easier to quantify than false negatives of an overly narrow search. In MOPAC's case, I estimated false positive rates for two different, separated periods of time, and found them to be very similar. Thus, I suspect that it is safe to use a constant false positive rate for each code

Could you perhaps elaborate a bit on what makes you confident of the premise of a constant false positive rate?

While your two data points for the case of MOPAC seem to match it, I can't come up with a fundamental reason why the growth rate of code "X" and the term "X" in the rest of Google scholar can be assumed to correlate more generally.

I don't know what the origin of the false positives for MOPAC in your search are, but for codes that are named after words from day-to-day language (WEST, exciting, ...) or perhaps even more so for those named after somewhat rare words (fleur, Amber, ORCA) I can certainly imagine a sudden uptick of false positives because the word appears in some other context that is becoming a hot topic of research. Or, conversely, a strong relative reduction of false positives because the code is young and its user base is growing much faster than the use of the search term in Google Scholar (citations of MOPAC have remained relatively stable over the last decade, while for, say, ORCA, annual citations have grown by one order of magnitude). This would result in an incorrect representation of the citation trend for that code, which I consider a key aspect of atomistic.software (perhaps more so than highly accurate absolute citation numbers/ranking). It would also be difficult to detect, while mitigating the issue of false negatives with the current approach seems to be more straightforward (by regularly checking for a more recent review paper / version release of a code). On second thought, I believe you are right that it would be easier to detect, but one may have to be very careful and for each update sift through a large portion of the search results for each code, and carry a time-dependent false positive rate (which complicates the approach). There is also the question of what to do with popular packages with >1000 citations where Google only provides links to the first 1000 results.

Secondly, one of the big advantages of using Google scholar is that it allows atomistic.software to provide direct links to the search results (no paywall), so that users can browse the current research done with a code. I would consider it a significant downside to have those results littered with false positives resulting from a more general query.

Given this analysis, I am initially skeptical of implementing this idea - but I certainly remain open to discussion.

At the same time, I am very interested in reducing false negatives in the current approach - a factor of 2 seems high to me, and not very satisfactory. One thing I've started doing is to include, besides the "main author of a code" (which is anyhow not well defined), first authors of the review articles via "OR", since the main author can hide behind an "et al." in references. In my experience so far, this improves the agreement between this way of citation counting and other, independent methods - perhaps this would help in the case of MOPAC as well?

So far, this has been done only for a couple of codes in the list on an "as needed" basis (e.g. for MOLCAS), and I would very much appreciate help in rolling it out more broadly.

[1] https://github.com/ltalirz/atomistic-software/issues/29 [2] Some codes need exceptions, e.g. for a code like WEST the only way may be to directly count the citations of a reference, or for Gaussian where people will just cite the version of the code since this is the preferred citation prescribed by the authors. [3] I try, to varying degree of success.

godotalgorithm commented 2 years ago

I appreciate the thoughtful reply, and I will reply in-kind.

You acknowledge an obvious problem - that with the tools at hand, we cannot track the citations of all atomistic simulation codes with equal accuracy (for various reasons). While perhaps an oversimplification, I've attempted to quantify accuracy in my own citation-tracking efforts by taking a statistical perspective and assigning a false-positive rate (which I could estimate reasonably accurately) to the data I had available. I should also clarify that while I used a time-invariant Google Scholar search string and assigned a time-invariant false-positive rate, there were time-specific factors that led to those choices. For example, MOPAC, like Gaussian and a few other codes, is very often cited with a year in the name (e.g. MOPAC2016), which is a complicating factor that doesn't affect much older citations to MOPAC. Certainly, the citation-gathering/quantifying process will need to change with time to maintain accuracy and relevance - if new sources of either false positives or false negatives emerge, then search criteria certainly need to be adjusted accordingly and past citation counts may even change somewhat.

I understand your desire to keep the false-positive rate relatively low, as people might want to browse through lists of papers citing these codes (and you provide a link to do so). I tolerated a somewhat high false-positive rate in my own efforts because I was rescaling the data accordingly, but it is similarly reasonable to report raw numbers and cap the tolerable false-positive rate to a relatively small amount (although this removes some flexibility from citation gathering).

Whether or not you decide to track or utilize a false-positive rate (to rescale the raw numbers) on this website, I think developers should be taking more responsibility in how their codes are perceived and more actively protect their work and scientific interests. Thus, while I do have some interest in better citation tracking of other codes, it is a much higher priority for me that MOPAC's citations are not severely undercounted. To that end, I've adjusted my own citation tracking of MOPAC to reduce the false-positive rate to a more acceptable number. I've adjusted my Google Scholar search string to:

(MOPAC OR "MOPAC93" OR "MOPAC97" OR "MOPAC2000" OR "MOPAC2007" OR "MOPAC2009" OR "MOPAC2012" OR "MOPAC2016") -expressway -policing -"Missouri Pacific" -"motif finding" -"primed amplification" -"MOPAC Blvd" -"MOPAC Boulevard" -"North MOPAC" -"Mopac Expy"

This was the best I could do because there is a limit to the length of a Google Scholar search string. Also, you should add the Google Scholar URL command &as_vis=1 to searches for MOPAC and all other codes to remove citations that don't correspond to actual papers. MOPAC has a lot of these from the days when QCPE software was often cited directly in the literature. "Stewart" is not useful in this search string because MOPAC is very frequently cited by just the program name or the name and the website (openmopac.net), and I suspect that MOPAC correspondingly has one of the highest false-negative rates on your list.

My own estimate of a false-positive rate was targeting specifically papers with DOIs, and this will still be nontrivial rate, even with this optimized search string, since Google Scholar indexes books, theses, conference proceedings, preprints, technical reports, and other research products that aren't scientific papers. However, with the more generous target of Google Scholar entries legitimately referring to the computer program MOPAC, my estimate of the false-positive rate is now 4% (estimated from 5 random samples per year between 1990-2020).

I think a simple yet fair course of action is to just keep things as they are and accept revisions to your Google Scholar search strings that increase citation counts (thus apparently reducing false-negative rates) without causing a substantial false-positive rate. Developers should know how their codes are cited better than other people, and their expertise should be a welcome contribution to this website.

ltalirz commented 2 years ago

Thank you Jonathan for the detailed reply!

I think a simple yet fair course of action is to just keep things as they are and accept revisions to your Google Scholar search strings that increase citation counts [...] without causing a substantial false-positive rate. Developers should know how their codes are cited better than other people, and their expertise should be a welcome contribution to this website.

I agree.

The main downside I see is the increasing complexity of the search strings (I hope this will not create more maintenance work down the line), but I fully understand that absolute citation counts are important to the code developers, and I guess that is the price to be paid.

In the process, we may temporarily be introducing some imbalance in the citation counts between those codes whose search strings have already been updated and those whose haven't been updated yet but I guess we can live with that as well. On the topic of requiring the author names to be part of the search string, I agree that the search string should mirror how the code is usually cited, and so there can be valid exceptions to this rule of thumb as long as the false positive rate remains small - say, below 10%.

There may be some edge cases where one needs to be careful, like the codes with >1000 citations (there certainly is some "order" to search results, i.e. a false positive rate is not necessarily constant across results), but overall I am confident that the search strings can be improved in this way across the board.

Eventually, we may want to actively solicit input on the search strings from developers, e.g. by contacting them directly (not all at once, but one after the other). So far I was hesitant to do this since this can be a sensitive topic and I know how busy developers are, but as atomistic.software becomes more widely used, the cost/benefit analysis may also change here.

Also, you should add the Google Scholar URL command &as_vis=1 to searches for MOPAC and all other codes to remove citations that don't correspond to actual papers.

Is that what this field does? From the Google Scholar FAQ I understood that these entries refer to citations Google Scholar is aware of but has no online resource for, which could be content from journals or not.

In a few spot checks I did, there was either no difference in the citation count or a very minor difference, so I guess the decision of whether to leave those in or not is not very consequential (unless you have made observations to the contrary).

Since we have come to agree on how to move forward, I will close this issue. Anyone wanting to reopen this issue can feel free to do so.

godotalgorithm commented 2 years ago

The &as_vis=1 Google Scholar search modifier has a substantial effect for windows of time further into the past. It removes a ~30% false-positive rate from MOPAC in the late 1980's and early 1990's from a lot of citations to MOPAC as a software product distributed by the Quantum Chemistry Program Exchange (QCPE). These citations were in the bibliography of papers, they were done in a very non-uniform way, and they didn't correspond to other papers, so Google Scholar resolves them as a large number of erroneous citation entries from that period of time. While anything presented as a "citation" by Google Scholar is unlikely to be a meaningful contribution to any code's citation count, there aren't any modern practices that I'm aware of that are causing this to be a problem for citations in the last decade or so. Thus, as long as you don't examine citations earlier than 2010, this isn't of major concern to you.

ltalirz commented 2 years ago

Thanks for the clarification!

These citations were in the bibliography of papers, they were done in a very non-uniform way, and they didn't correspond to other papers, so Google Scholar resolves them as a large number of erroneous citation entries from that period of time.

I still didn't quite get what you mean by "they didn't correspond to other papers", could you just elaborate on this last bit? I would like to understand this better, e.g. in case we ever feel the need to extend the citation period further back into the past.

From the Google Scholar FAQ I would have concluded that Google Scholar came across these citations while processing the full-text archives of some journals, for which no corresponding online resources exist. That would make these citations difficult to check (we kind of have to take Google Scholar's word for it), but it seems to me you are convinced that these citations are erroneous. Perhaps an example would help me understand?

godotalgorithm commented 2 years ago

Well, an example with a large number of these would be citations to MOPAC in 1993, to its final early open-source release (MOPAC 7) and its first commercial release (MOPAC93):

https://scholar.google.com/scholar?q=MOPAC+OR+%22MOPAC93%22+OR+%22MOPAC97%22+OR+%22MOPAC2000%22+OR+%22MOPAC2007%22+OR+%22MOPAC2009%22+OR+%22MOPAC2012%22+OR+%22MOPAC2016%22+-expressway+-policing+-%22Missouri+Pacific%22+-%22motif+finding%22+-%22primed+amplification%22+-%22MOPAC+Blvd%22+-%22MOPAC+Boulevard%22+-%22North+MOPAC%22+-%22Mopac+Expy%22&hl=en&as_sdt=0%2C47&as_ylo=1993&as_yhi=1993

There are clearly numerous, very similar citations to MOPAC and/or its manual that are sprawled out as distinct "citation" entries in Google Scholar. My rough guess is that when Google Scholar ingests the bibliographies of papers, it assumes that each entry will be a citation to a research product (i.e. a paper, book, or conference proceeding) or a footnote. If it sees a year in an entry, it just assumes that it is a research product, and either connects it to a known research product or citation entry from that year or adds an empty citation entry for that year if there are no sufficiently close matches. The "cited by" links for each citation entry can let you see the papers that caused these erroneous entries to appear. For example, the top entry from that year with the most citations, "MOPAC 93.00 Manual", comes from bibliography entries such as:

(􏰎a)􏰀 J. J. P. Stewart, MOPAC program package 􏰎(MOPAC7/MOPAC93􏰀), QCPE-NO. 455 􏰎(1993)􏰀; (􏰎b)􏰀 J. J. P. Stewart, MOPAC93.00 Manual, Fujitsu Ltd., Tokyo, Japan 􏰎(1993)􏰀.

There is an ongoing trend of moving to direct bibliographic citations of software rather than the more common citations to software release papers, and this might cause these sorts of problems to reemerge.

ltalirz commented 2 years ago

Thanks, I think I finally get it now.

ltalirz commented 2 years ago

There is an ongoing trend of moving to direct bibliographic citations of software rather than the more common citations to software release papers, and this might cause these sorts of problems to reemerge.

Yes, although if we manage to get the community to adopt the FORCE11 software citation principles, in particular the use of unique, persistent identifiers, I think it can also be an improvement upon the status quo. E.g. code developers may be getting statistics about which versions of their software are being used.

I like e.g. the implementation by Zenodo, where a software has both a "concept DOI" and a separate DOI for each version (Figure 9 in the accompanying article to atomistic.software).