Closed mcopik closed 3 years ago
And is there a way to obtain publication ID when it has not been cited yet?
It is possible to get an author_pub_id
from an author's profile, but a citedby_url
(and hence cites_id
) are not assigned by Google Scholar if it hasn't been cited. Your application should be able to handle the non-existence of these fields.
Is it an oversight or is there a fundamental problem and those IDs cannot be obtained in a consistent manner?
I think it's mostly an oversight, but there's also an issue with obtaining it consistently. Different versions of a publication could exist with different cites_id
. When obtaining them through an author's profile, we can fetch them all. However, it is not clear if we obtain them all when searching for a publication. It is perhaps easiest if you extract the cites_id
yourself from citedby_url
and we may include the cites_id
field in a future version of scholarly
.
I've been trying to identify a unique publication ID that I could use to store and retrieve citations later.
I've pondered about this. There's no way to uniquely refer to a publication (apart from inconvenient metadata like title, author names etc.) when none of the coauthors of a public Google Scholar profile and when it has not been cited :-)
@arunkannawadi Thank you for a swift reply! After looking at the documentation, I've got the impression that there's no unique ID. I guess I'll have to live with the imprecise metadata in such case :-)
Closing the issue as there's no need for a fix.
Perhaps the cites=....
value can be extracted from the citedby_url
from the publications that have source PUBLICATION_SEARCH_SNIPPET
. I remember trying to decipher the Google Scholar logic around these ids, but I do not think we finished adding all the necessary logic in the parsers.
I forgot to mention that looking for the unique ID in url_related_articles
is the best I had come up with so far. All articles have this without any exception AFAIK. There is also a cluster ID which is how Google Scholar identifies the multiple versions of the same document. I don't think we scrape the cluster ID and we should be able to get it in the next release.
Oh well, the third item in your search query 'SeBS serverless' appears to be a case where neither "Related articles" nor "All N versions" links are found, so there's no way to uniquely refer to it. It appears to be hosted on a University website, usually not the kind of article that we expect people to be interested in. But we should have a way to extract the cluster ID regardless whenever it is available.
@arunkannawadi The third paper in the search query is a paper citing our arXiv paper; the name of our tool appears in the abstract. Indeed, this seems to be a user version pdf posted on the author's website and the conference proceedings are not yet online.
https://scholar.google.com/scholar?cluster=12885035189720623002&hl=en&as_sdt=0,5
Hi,
I've been trying to identify a unique publication ID that I could use to store and retrieve citations later. In my usage scenario, I look for the paper by its name. However, it seems that such ID is not available when searching for a publication with a title, even though the
citedby_url
is available.Example:
However, the same publication obtained through direct search has much more data, including a similar
citedby_url
with exactly the same ID, but it doesn't have the IDs.It looks like this particular feature is not implemented: https://github.com/scholarly-python-package/scholarly/blob/main/scholarly/publication_parser.py#L252
Is it an oversight or is there a fundamental problem and those IDs cannot be obtained in a consistent manner? And is there a way to obtain publication ID when it has not been cited yet? The only way it seems to be to look for the first author with Scholar ID, and obtain the
author_pub_id
from their profile; which also breaks if none of the authors have a Scholar ID :-)