serpapi / public-roadmap

Public Roadmap for SerpApi, LLC (https://serpapi.com)
50 stars 3 forks source link

[Google Scholar Cite API] Scrape BibTeX, EndNote, RefMan, RefWorks, DOI Links #166

Open aliayar opened 2 years ago

aliayar commented 2 years ago

Scholar Cite API does not send requests to BibTeX, EndNote, RefMan, RefWorks to collect them. More users are looking to get those data points.

The Playground | The Inspect


"links": [
    {
      "name": "BibTeX",
      "link": "https://scholar.googleusercontent.com/scholar.bib?q=info:lugrJQxNOc4J:scholar.google.com/&output=citation&scisdr=CgU-HvXjGAA:AAGBfm0AAAAAX_YjGYhV-YrFXcxZWhG2KfAKEMPjrsJN&scisig=AAGBfm0AAAAAX_YjGXeOdA5ZdH5OKvIH-PRqYgNI7Paj&scisf=4&ct=citation&cd=-1&hl=en"
    },
    {
      "name": "EndNote",
      "link": "https://scholar.googleusercontent.com/scholar.enw?q=info:lugrJQxNOc4J:scholar.google.com/&output=citation&scisdr=CgU-HvXjGAA:AAGBfm0AAAAAX_YjGYhV-YrFXcxZWhG2KfAKEMPjrsJN&scisig=AAGBfm0AAAAAX_YjGXeOdA5ZdH5OKvIH-PRqYgNI7Paj&scisf=3&ct=citation&cd=-1&hl=en"
    },
    {
      "name": "RefMan",
      "link": "https://scholar.googleusercontent.com/scholar.ris?q=info:lugrJQxNOc4J:scholar.google.com/&output=citation&scisdr=CgU-HvXjGAA:AAGBfm0AAAAAX_YjGYhV-YrFXcxZWhG2KfAKEMPjrsJN&scisig=AAGBfm0AAAAAX_YjGXeOdA5ZdH5OKvIH-PRqYgNI7Paj&scisf=2&ct=citation&cd=-1&hl=en"
    },
    {
      "name": "RefWorks",
      "link": "https://scholar.googleusercontent.com/scholar.rfw?q=info:lugrJQxNOc4J:scholar.google.com/&output=citation&scisdr=CgU-HvXjGAA:AAGBfm0AAAAAX_YjGYhV-YrFXcxZWhG2KfAKEMPjrsJN&scisig=AAGBfm0AAAAAX_YjGXeOdA5ZdH5OKvIH-PRqYgNI7Paj&scisf=1&ct=citation&cd=-1&hl=en"
    }
  ]
bibtex
dimitryzub commented 2 years ago

A workaround for this right now is to make an additional request on the client-side and extract the data.

Workaraound caveats

Requests will be blocked at some point without proxies. However, in my browser requests are also being blocked from time to time. Not sure why.

Examples in Python

Try in the online IDE.

Extracting BibTex data:

import requests

# gBUAThD_S0oJ -> publication ID from organic results
cite_request = requests.get('https://scholar.googleusercontent.com/scholar.bib?q=info:gBUAThD_S0oJ:scholar.google.com/&output=citation&scisdr=CgXc0aSeEILWoQxhryY:AAGBfm0AAAAAYphntyZOhOIUdAWNIub-5_kh7noCkkdQ&scisig=AAGBfm0AAAAAYphnt2EC0ZvtGK0-KKcji1Kpe6BwvZo5&scisf=4&ct=citation')

print(cite_request.text)

Outputs:

@article{buie2010evaluation,
  title={Evaluation, diagnosis, and treatment of gastrointestinal disorders in individuals with ASDs: a consensus report},
  author={Buie, Timothy and Campbell, Daniel B and Fuchs, George J and Furuta, Glenn T and Levy, Joseph and VandeWater, Judy and Whitaker, Agnes H and Atkins, Dan and Bauman, Margaret L and Beaudet, Arthur L and others},
  journal={Pediatrics},
  volume={125},
  number={Supplement\_1},
  pages={S1--S18},
  year={2010},
  publisher={American Academy of Pediatrics}
}

Extracting EndNote, RefMan:

It's a little bit different because when you go to the URL, you get prompted with downloading file locally. To make it somewhat work, we can use urlretrieve urllib method:

import urllib.request 

# adding user-agent to request headers, otherwise the request will be blocked 100/100.
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36')]
urllib.request.install_opener(opener)

# EndNote
urllib.request.urlretrieve('https://scholar.googleusercontent.com/scholar.enw?q=info:gBUAThD_S0oJ:scholar.google.com/&output=citation&scisdr=CgXc0aSeEILWoQxsJBg:AAGBfm0AAAAAYphqPBikvgDGYhw5sBkBY8QsRFJ6QnRx&scisig=AAGBfm0AAAAAYphqPIHSfAe0iiGlP0pAU9Lru-KuKj7N&scisf=3&ct=citation&cd=-1&hl=en', 'endNote_file.enw')

# RefMan
urllib.request.urlretrieve('https://scholar.googleusercontent.com/scholar.ris?q=info:gBUAThD_S0oJ:scholar.google.com/&output=citation&scisdr=CgXc0aSeEILWoQxsJBg:AAGBfm0AAAAAYphqPBikvgDGYhw5sBkBY8QsRFJ6QnRx&scisig=AAGBfm0AAAAAYphqPIHSfAe0iiGlP0pAU9Lru-KuKj7N&scisf=2&ct=citation&cd=-1&hl=en', 'refMan_file.ris')

Endnote output:

# endNote_file.enw

%0 Journal Article
%T Evaluation, diagnosis, and treatment of gastrointestinal disorders in individuals with ASDs: a consensus report
%A Buie, Timothy
%A Campbell, Daniel B
%A Fuchs, George J
%A Furuta, Glenn T
%A Levy, Joseph
%A VandeWater, Judy
%A Whitaker, Agnes H
%A Atkins, Dan
%A Bauman, Margaret L
%A Beaudet, Arthur L
%J Pediatrics
%V 125
%N Supplement_1
%P S1-S18
%@ 0031-4005
%D 2010
%I American Academy of Pediatrics

RefMan output:

RefMan_file.ris

TY  - JOUR
T1  - Evaluation, diagnosis, and treatment of gastrointestinal disorders in individuals with ASDs: a consensus report
A1  - Buie, Timothy
A1  - Campbell, Daniel B
A1  - Fuchs, George J
A1  - Furuta, Glenn T
A1  - Levy, Joseph
A1  - VandeWater, Judy
A1  - Whitaker, Agnes H
A1  - Atkins, Dan
A1  - Bauman, Margaret L
A1  - Beaudet, Arthur L
JO  - Pediatrics
VL  - 125
IS  - Supplement_1
SP  - S1
EP  - S18
SN  - 0031-4005
Y1  - 2010
PB  - American Academy of Pediatrics
ER  - 

RefWorks

Not sure about ReWorks as it requires a login to export the data:

image

image


Anything I'm missed or did wrong?

kagermanov27 commented 8 months ago

This issue could also use DOI numbers:

   {
      "position": 0,
      "title": "Effects of atmospheric water on· OH-initiated oxidation of organophosphate flame retardants: a DFT investigation on TCPP",
      "result_id": "kQ0PActc-woJ",
      "link": "https://pubs.acs.org/doi/abs/10.1021/acs.est.7b00347",
      "snippet": "Tris (2-chloroisopropyl) phosphate (TCPP), a widely used organophosphate flame retardant, has been recognized as an important atmospheric pollutant. It is notable that TCPP has potential for long-range atmospheric transport. However, its atmospheric fate is unknown …",
    },

The idea is moved from https://github.com/serpapi/SerpApi/issues/1453.