pybliometrics-dev / pybliometrics

Python-based API-Wrapper to access Scopus
https://pybliometrics.readthedocs.io/en/stable/
Other
420 stars 129 forks source link

AbstractRetrieval - startref and refcount parameters do not work #225

Closed LukasWallrich closed 2 years ago

LukasWallrich commented 2 years ago

Thanks for this wonderful package. I am trying to extract full reference lists, yet AbstractRetrieval only returns 40. This should have been solved by #154 but at least for me, these parameters don't work.

For instance, the following code still gives me 40 references, starting from the first.

ac = AbstractRetrieval("10.1111/joms.12670", view = "REF", startref = 5, refcount = 10)

When I use the API directly, it works as expected: https://api.elsevier.com/content/abstract/doi/10.1111/joms.12670?apiKey={KEY}&view=REF&startref=5&refcount=10

(Is there any way to see the search that pybliometrics actually sends?)

Michael-E-Rose commented 2 years ago

I get 10, as intended.

I think it's because you previously made a request with the same DOI and view, which jointly determine the location where pybliometrics saves the response. If the response exists, pybliometrics uses the response and does not download anew. To change this, use refresh.

You find the location using ac._cache_file_path.

LukasWallrich commented 2 years ago

Thanks for getting back - that works perfectly. Would it make sense to refresh automatically when the query params change? As far as I understand the API, that should almost always indicate that the user expects different results?

Michael-E-Rose commented 2 years ago

Most parameters (in this and other classes) become part of the filename, to distinguish different versions of the same query or retrieval.

In this case I didn't do this because I could not think of a use case that warrants such a change. If we change it retrospectively, all the previously downloaded documents cannot not be used anymore.

What is your use case to look only for 10 references?

LukasWallrich commented 2 years ago

Thanks for clarifying. I am not actually interested in limiting the number but in going beyond 40, the maximum returned at the start - as I understand the API, I then need to start iterating and limit the number of results to those available.

Michael-E-Rose commented 2 years ago

I actually never realized it's only the first 40 references. Thanks for bringing this up! Will have to think about a solution.

Michael-E-Rose commented 2 years ago

At least the truncation only affects the REF view:

>>> from pybliometrics.scopus import AbstractRetrieval
>>> ab = AbstractRetrieval('2-s2.0-84986260127', view='FULL')
>>> len(ab.references)
42
>>> ab = AbstractRetrieval('2-s2.0-84986260127', view='REF')
>>> len(ab.references)

Information the REF view provides but the FULL view doesn't includes author IDs, the citation count of the referenced article, its volume/issue/pages information, and whether the referenced article is part of Scopus (field type) or not.

Michael-E-Rose commented 2 years ago

I got to talk to a Scopus developer. They might look into this cap of 40 references for the REF view. Probably this was introduced to limit traffic as the REF view pulls information from somewhere else. Therefore I don't plan changes to this class yet, but I will allow for arbitrary keywords as in the other classes.

Michael-E-Rose commented 2 years ago

Actually the kwds are already allowed - therefore I close this for now. Hopefully Scopus comes up with a clever solution for the references this summer.