scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.36k stars 298 forks source link

Get 'pub_url' for each publication #496

Open hfchen20 opened 1 year ago

hfchen20 commented 1 year ago

Describe the bug I need to get 'pub_url' for each publication by a scholar. I found that the 'pub_url' field is only available by running with (free) proxies. It is not available if not turning on the proxy. I wonder how I would get the 'pub_url' in both cases (whether using proxies or not).

To Reproduce

from pprint import pprint
from scholarly import scholarly, ProxyGenerator

useProxy = True

if useProxy:
  pg = ProxyGenerator()
  pg.FreeProxies()
  scholarly.use_proxy(pg)

author = scholarly.search_author_id('4bahYMkAAAAJ')
scholar = scholarly.fill(author, 
                        sections=['basics', 'publications'],
                        sortby="year",
                        publication_limit=5)

for pub in scholar['publications']:
  pub_filled = scholarly.fill(pub)
  pprint(pub_filled)
arunkannawadi commented 1 year ago

There's no difference in behavior whether you use proxies or not except for the success rate of the requests. Please post the outputs (from the same version of the code of course) in both the cases if there is an issue.

hfchen20 commented 1 year ago

@arunkannawadi Thank you for the clarification. After a few more runs (on Colab), the 'pub_url' issue disappears irrespective of using proxies. However, I did see this field was absent when I reported the issue here, which happened to coincide with switching on/off the proxy option. Issue closed.

hfchen20 commented 1 year ago

@arunkannawadi Hello, the absence of 'pub_url' occurred again. Could you please help find out the reason behind this issue?

I forked scholarly and made a few minor changes related to pub_info: 'author', 'pub_date'. You could see the changes here. https://github.com/hfchen20/scholarly.git

Here is a Colab test notebook. https://drive.google.com/file/d/1GbWrHEK7REOWYx7ZNffgOltRu7siYNk1/view?usp=share_link

image

arunkannawadi commented 1 year ago

This issue seems very Google Colab specific. I ran your notebook on colab and did not get pub_url entries, but running it as a script locally gives me the pub_url. Very weird!

hfchen20 commented 1 year ago

@arunkannawadi Okay. Running the Colab notebook did return the pub_url sometimes, but not always. Thank you very much for the testing!