scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.39k stars 299 forks source link

Possible to add scholar_id array to Publication objects that get populated upon Author fill #405

Closed mzhukovs closed 2 years ago

mzhukovs commented 2 years ago

The author_id or scholar_id array which contains the Google Scholar IDs of the author(s) of the Publication would be very useful if possible to include in the Publications array attached to an Author.

Currently, it seems the only way to get this, is to get the search_authors results, fill the Authors (including 'publications'), then search for each publication and get the result of that, because even calling fill on the publication object directly returned still does not give one the author_id array.

arunkannawadi commented 2 years ago

TL;DR: It's a limitation of Google Scholar, not just scholarly.

The way you mentioned is exactly how you can find theauthor_ids of a publication's authors in Google Scholar, and scholarly as a scraper library is implemented to do exactly that. Since Google Scholar doesn't link the profiles of all the authors to the publication object reachable from a Google Scholar profile, scholarly can't fetch it directly either. Even when it links Google Scholar profile to publication searches, it does so only for the first few authors and is not guaranteed to give a complete list author_id of all coauthors.

Rather than searching for publication (which is subject to being blocked), you can get the complete list of coauthors' names for a publication, search for the author's profile by name using the search_author API, verify that this publication is listed on their GS profile if you like to be sure and then append that scholar's author_id to the list. This would be a much preferred way without triggering captchas. But all of this must come from your/application's end, not from scholarly itself.

arunkannawadi commented 2 years ago

I put together a notebook based on some code I had. This shows that it's simple for an application to do what you want, but complex for the library to do it for you.

https://colab.research.google.com/drive/1BIvZoe-mp_z9M-zuX__LCN3tNgO7Qsyt?usp=sharing

mzhukovs commented 2 years ago

Thanks for quick/prompt reply and explanation @arunkannawadi (and of course all the work on this great lib), despite what was probably not the best description on my end looks like you knew exactly what I was referring to.

mzhukovs commented 2 years ago

@arunkannawadi sorry one more question -

So on the FILLED author objects with the coauthors section - is there any way around the limit of 20 coauthors? I see this getting printed "Fetching only the top 20 coauthors" and no param for changing?

And is there any sort of sorting or selection mechanism for the list that does come through there? e.g. can we say those are their top 20 coauthors based on total # of citations across jointly authored papers, or is that far from the truth?

arunkannawadi commented 2 years ago

Please see the optional dependencies section in our README. You'll need to install geckodriver or chrome-driver to get around the limitation of getting only 20 coauthors.

The list you see without it, is the list of 20 (max) coauthors that shows up in one's Google Scholar profile without clicking on the VIEW ALL button. I am not entirely sure about the ordering, but it's static as far as I know.

In my notebook, I could have followed the approach of going through the coauthor's Google Scholar profile instead of searching for profiles by names. I did not do that because adding co-authors to a profile needs manual intervention. Google Scholar only suggests adding coauthors, and does not add them automatically. This means many people (including me) have at best only a partial list of coauthors, even amongst those who have a Google Scholar profile.

mzhukovs commented 2 years ago

Very useful information, thanks again. I had devised a way to efficiently build a network graph using only the search_author_id and fill author methods, without need for proxy, but seems the main limitation (even if able to pull more than the 20 coauthors from the profile) is that the list is still incomplete and fully a function of how well they've updated it.

arunkannawadi commented 2 years ago

You should use search_author methods when you know there are coauthors that are not listed. The main limitation in that case will be authors who don't have a public Google Scholar profile.