scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.39k stars 299 forks source link

Can't find working solution for search_author_custom_url() #152

Closed Nicholas-Lewis-USDA closed 4 years ago

Nicholas-Lewis-USDA commented 4 years ago
def search_author_custom_url(self, url: str):
        """Search by custom URL and return a generator of Author objects
        URL should be of the form '/citation?q=...'"""
        return self.__nav.search_authors(url)

I am unsure to as why this is the requested format or even how to use it. Google's Scholar Author accounts are listed as https://scholar.google.com/citations?hl=en&user=C_mP-LwAAAAJ

author.py > _CITATIONAUTH is a perfect match for this if I was to provide the user information.

When I try to manually use the requested format, scholars returns only my own profile.

When using in code it finds nothing.

author_search.append("C_mP-LwAAAAJ") #Jessica Hicks
author_search.append("/citations?hl=en&user=G2B4pOEAAAAJ") #Tod Stuber

I have tried these formats as with the full HTML with no luck. This method would be great as I have a few users who have very common name's and right now I rely on switch to try pull the exact author without having to loop each author and check if it's the right one. However, if Google ever updates / changes the order, then I am forced to loop one bye one and some have results of 20+.

silvavn commented 4 years ago

You are correct, the current implementation of the function is broken. Author takes the author row to be initialized. To implement that you would need to expand the Author class to take an author page instead of an author row.

Could you propose a PR for that?

Nicholas-Lewis-USDA commented 4 years ago

I will have to learn how to do PR's.

However for now: author.py:

class Author:
    """Returns an object for a single author"""

    def __init__(self, nav, __data, the_id="none"):
        self.nav = nav
        self._filled = set()
        self._sections = {'basics',
                          'indices',
                          'counts',
                          'coauthors',
                          'publications'}

        if isinstance(__data, str):
            self.id = __data
        else:
            if the_id != "none":
                self.id = the_id
                return
            else:
                self.id = re.findall(_CITATIONAUTHRE, __data('a')[0]['href'])[0]

_scholarly.py:

def search_author_custom_url(self, url: str):
        """Search by custom URL and return a generator of Author objects
        URL should be of the form '/citations?hl=en&user={user's 12 character code}'"""
        return self.__nav.search_author_url(url)

_navigator.py:

def search_author_url(self, url: str):
        """Generator that returns Author object"""
        soup = self._get_soup(url)
        while True:
            row = soup.find_all('div', 'gs_scl')
            yield Author(self, row, url[len(url)-12:len(url)])
            break

Usage scholarly.search_author_url(/citations?hl=en&user={user's 12 character code})

Still returned as a object

Output is a little messed up as I didn't fully understand everything. .email isn't working at all as _find_tag_class_name didn't work and I didn't have time to look into it.

Not the cleanest, but perhaps you can help fill in the gaps.

Nicholas-Lewis-USDA commented 4 years ago

I am not sure what happened, but the above example is no longer working. I will have to play with it when I get back.

TomBrien commented 4 years ago

I think #148 which is merged provides the required functionality here

ipeirotis commented 4 years ago

It is now possible to search for an author using the author_id.