Keyword arguments in AuthorRetrieval.get_documents() are ignored

RSHum23 commented 2 months ago

pybliometrics version: 4.0

Code to reproduce the bug: len(AuthorRetrieval(36766719200).get_documents(subtypes=None, kwargs="PUBYEAR = 2024")) This gives the total number of documents associated to the specified author id, no matter the publication year. Note that the string "PUBYEAR = 2024" could be formatted badly, as the code doesn't recognise that.

Expected behavior: It should return the number of documents, associated to the specified author ID, published in the specified year.

I'm not sure how I should enter the kwargs in this case to specify the additional parameter of the query. I tried to go through the classes and superclasses etc. to understand how the process works, and I believe that at some point (maybe in superclass Search.__init__ lines 45-51) the built-in query specifying the author ID should be put together with the additional kwargs, but something must get lost in the process. Sorry if I totally misunderstood how the method works.

Michael-E-Rose commented 2 months ago

At the end of https://pybliometrics.readthedocs.io/en/stable/classes/AuthorRetrieval.html#pybliometrics.scopus.AuthorRetrieval.get_documents it says

Note: To update these results, use refresh; the class’ refresh parameter is not used here.

Michael-E-Rose commented 2 months ago

The parameter kwargs does not take a string like "PUBYEAR = 2024", Scopus is not SQL. These key word arguments refer to stuff documented on https://dev.elsevier.com/documentation/ScopusSearchAPI.wadl.

In general, to subset your results to a specific year, simply iterate over the results and apply a filter. Straightforward and efficient with pandas.

RSHum23 commented 1 month ago

Thanks for you reply and suggestions.

The parameter kwargs does not take a string like "PUBYEAR = 2024", Scopus is not SQL. These key word arguments refer to stuff documented on https://dev.elsevier.com/documentation/ScopusSearchAPI.wadl.

In principle, as the documentation says "kwds (str) – Parameters to be passed on to ScopusSearch()", I should be able to specify some additional parameters for the query, like "date=2002-2007" (field "date"), so maybe the right way to do this is not "PUBYEAR=2024" but "date=2024". But this would be ignored anyway by .get_documents().

In general, to subset your results to a specific year, simply iterate over the results and apply a filter. Straightforward and efficient with pandas.

I'm already doing this, but I wanted to reduce the amount of data that I'm extracting from Scopus (as I'm interested in a very specific time-window), in order to also reduce the time needed to extract the data. The rationale should be that I can select and filter the data before downloading them from Scopus, isn't it?

Michael-E-Rose commented 1 month ago

I should be able to specify some additional parameters for the query, like "date=2002-2007" (field "date"), so maybe the right way to do this is not "PUBYEAR=2024" but "date=2024". But this would be ignored anyway by .get_documents().

It's not ignored when you also use refresh=True. The kwds apply only when there is no cached result, or when the cached result is refreshed.

The rationale should be that I can select and filter the data before downloading them from Scopus, isn't it?

The rationale is correct, but the gain in time is miniscule at best. Since pybliometrics caches the result, subsequent analysis is much faster and this offsets the time gain from a faster download. For instance, if you try another year range, the full data can be reused, whereas you would have to download them again if you relied only on the kwds.

RSHum23 commented 1 month ago

I see, thank you! All clear now

pybliometrics-dev / pybliometrics

Keyword arguments in AuthorRetrieval.get_documents() are ignored #343