scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.39k stars 299 forks source link

Missing publicaton ID from search results #350

Closed mcopik closed 3 years ago

mcopik commented 3 years ago

Hi,

I've been trying to identify a unique publication ID that I could use to store and retrieve citations later. In my usage scenario, I look for the paper by its name. However, it seems that such ID is not available when searching for a publication with a title, even though the citedby_url is available.

Example:

>>> author = scholarly.search_author_id('JdXd8pQAAAAJ')
>>> scholarly.fill(author)
>>> pub = author['publications'][5]
>>> scholarly.pprint(pub)
{
 'author_pub_id': 'JdXd8pQAAAAJ:Tyk-4Ss8FVUC',
 'citedby_url': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=12885035189720623002',
 'cites_id': ['12885035189720623002'],
 'filled': False,
 'num_citations': 6,
 'source': 'AUTHOR_PUBLICATION_ENTRY'
}

However, the same publication obtained through direct search has much more data, including a similar citedby_url with exactly the same ID, but it doesn't have the IDs.

>>> search_query = scholarly.search_pubs('SeBS serverless')
>>> scholarly.pprint(next(search_query))
{
 'author_id': ['JdXd8pQAAAAJ', 'PyY2WfkAAAAJ', 'l3ZOsHkAAAAJ'],
  'citedby_url': '/scholar?cites=12885035189720623002&as_sdt=5,33&sciodt=0,33&hl=en',
 'eprint_url': 'https://arxiv.org/pdf/2012.14132',
 'filled': False,
 'gsrank': 1,
 'num_citations': 6,
 'pub_url': 'https://arxiv.org/abs/2012.14132',
 'source': 'PUBLICATION_SEARCH_SNIPPET',
 'url_add_sclib': '/citations?hl=en&xsrf=&continue=/scholar%3Fq%3DSeBS%2Bserverless%26hl%3Den%26as_sdt%3D0,33&citilm=1&update_op=library_add&info=mqe0uTzX0LIJ&ei=YERnYYmGC4iKmgGYwLj4Dg&json=',
 'url_related_articles': '/scholar?q=related:mqe0uTzX0LIJ:scholar.google.com/&scioq=SeBS+serverless&hl=en&as_sdt=0,33',
 'url_scholarbib': '/scholar?q=info:mqe0uTzX0LIJ:scholar.google.com/&output=cite&scirp=0&hl=en'}
}

It looks like this particular feature is not implemented: https://github.com/scholarly-python-package/scholarly/blob/main/scholarly/publication_parser.py#L252

Is it an oversight or is there a fundamental problem and those IDs cannot be obtained in a consistent manner? And is there a way to obtain publication ID when it has not been cited yet? The only way it seems to be to look for the first author with Scholar ID, and obtain the author_pub_id from their profile; which also breaks if none of the authors have a Scholar ID :-)

arunkannawadi commented 3 years ago

And is there a way to obtain publication ID when it has not been cited yet?

It is possible to get an author_pub_id from an author's profile, but a citedby_url (and hence cites_id) are not assigned by Google Scholar if it hasn't been cited. Your application should be able to handle the non-existence of these fields.

Is it an oversight or is there a fundamental problem and those IDs cannot be obtained in a consistent manner?

I think it's mostly an oversight, but there's also an issue with obtaining it consistently. Different versions of a publication could exist with different cites_id. When obtaining them through an author's profile, we can fetch them all. However, it is not clear if we obtain them all when searching for a publication. It is perhaps easiest if you extract the cites_id yourself from citedby_url and we may include the cites_id field in a future version of scholarly.

arunkannawadi commented 3 years ago

I've been trying to identify a unique publication ID that I could use to store and retrieve citations later.

I've pondered about this. There's no way to uniquely refer to a publication (apart from inconvenient metadata like title, author names etc.) when none of the coauthors of a public Google Scholar profile and when it has not been cited :-)

mcopik commented 3 years ago

@arunkannawadi Thank you for a swift reply! After looking at the documentation, I've got the impression that there's no unique ID. I guess I'll have to live with the imprecise metadata in such case :-)

Closing the issue as there's no need for a fix.

ipeirotis commented 3 years ago

Perhaps the cites=.... value can be extracted from the citedby_url from the publications that have source PUBLICATION_SEARCH_SNIPPET. I remember trying to decipher the Google Scholar logic around these ids, but I do not think we finished adding all the necessary logic in the parsers.

arunkannawadi commented 3 years ago

I forgot to mention that looking for the unique ID in url_related_articles is the best I had come up with so far. All articles have this without any exception AFAIK. There is also a cluster ID which is how Google Scholar identifies the multiple versions of the same document. I don't think we scrape the cluster ID and we should be able to get it in the next release.

arunkannawadi commented 3 years ago

Oh well, the third item in your search query 'SeBS serverless' appears to be a case where neither "Related articles" nor "All N versions" links are found, so there's no way to uniquely refer to it. It appears to be hosted on a University website, usually not the kind of article that we expect people to be interested in. But we should have a way to extract the cluster ID regardless whenever it is available.

mcopik commented 3 years ago

@arunkannawadi The third paper in the search query is a paper citing our arXiv paper; the name of our tool appears in the abstract. Indeed, this seems to be a user version pdf posted on the author's website and the conference proceedings are not yet online.

https://scholar.google.com/scholar?cluster=12885035189720623002&hl=en&as_sdt=0,5