sotetsuk / goscholar

Google scholar scraper written in Go
MIT License
17 stars 7 forks source link

deal with BibTeX information #87

Closed sotetsuk closed 8 years ago

sotetsuk commented 8 years ago

WHY

To acquire the author information, we must get the BibTeX information.

How

1. Naive solution

See release v0.0.1-alpha

func (a *Article) crawlAndParseBibTeX() {
    popURL, err := CitePopUpQuery(a.InfoId)
    if err != nil {
        log.Fatal(err)
    }
    popDoc, err := goquery.NewDocument(popURL)
    if err != nil {
        log.Fatal(err)
    }
    bibURL, _ := popDoc.Find("#gs_citi > a:first-child").Attr("href")
    bibDoc, err := goquery.NewDocument(SCHOLAR_URL + bibURL)
    if err != nil {
        log.Fatal(err)
    }
    a.Bibtex = bibDoc.Text()
}

2. scholar.py's solution

  1. send request to GET_SETTINGS_URL of scholar.py#L939 2016-05-15 21 02 55
  2. send request to SET_SETTINGS_URL of scholar.py#L969
  3. Import into BibTeX emerges. scholar.py#L457 scholar.py#L994 2016-05-15 21 03 10

    3. hildensia/scholar.py's solution

    • scholar.py#L201
    • Using "Import from BibTeX" by sending request with this header (?):
headers = {
    'User-Agent': self.UA,
    'Cookie': 'GSP=ID=%(ID)s:CF=%(CF)d' % {
         "ID": self.GID,
         "CF": self.cite_format
    }
}

4. gscholar's solution

Access directly to

https://scholar.google.com/scholar.bib?q=info:0qfs6zbVakoJ:scholar.google.com/&output=citation

See: https://github.com/5kg/gscholar/blob/master/lib/gscholar/paper.rb

This solution fails:

2016-05-15 19 34 08

Related