Retrieving PDF files automatically

levguy / talksumm

TalkSumm - Scientific Paper Summarization Based on Conference Talks

GNU General Public License v3.0

43 stars 8 forks source link

Retrieving PDF files automatically #3

Closed TGoldsack1 closed 3 years ago

TGoldsack1 commented 3 years ago

Hi! Is there a simple way to retrieve the PDF files automatically via the titles and URLs provided in data/talksumm_papers_urls.txt?

I have attempted to do this using a python script (requests, BeautifulSoup, etc.), however, all ACL paper URLs have Incapsula protection which prevents them from being accessed in this way. Thanks.

levguy commented 3 years ago

Hi Tomas, thank you for your interest in our work. We were not allowed to share direct links to the PDF files, therefore we have shared URLs to HTMLs which include the PDF links. The following lines of code might be helpful for you extracting the direct PDF links:

import requests
import re

# example url of one of the articles
url = "https://aclanthology.org/D15-1250/"

req = requests.get(url)
text = req.text

# regular expression to find a url that ends with '.pdf'
regex = re.compile('(http)(?!.*(http))(.*?)(\.pdf)')
result = regex.search(text)
start_i, end_i = result.span()
pdf_url = text[start_i:end_i]
print(pdf_url)

It would be good to add some validity checks, e.g. to verify that the HTML page does not contain two or more different PDF links. Hope this helps.

TGoldsack1 commented 3 years ago

Hi Guy, thank you for your helpful response! That makes sense. I had attempted something similar to your suggested code, but the issue I was having was that my requests made to the given DOI URLs ("https://doi.org/10.18653/v1/d15-1250", for example) were being redirected to "https://aclweb.org/anthology/D15-1250" rather than "https://aclanthology.org/D15-1250/", and the former was not returning the page HTML content in the response due to Incapsula protection. I've now put in place a validity check for this, as well as the ones you suggested, and I am able to retrieve the PDFs. Thanks again!

levguy commented 3 years ago

Hi Tomas, glad to hear that! Would be great if you could contribute your script to this repo :-)

TGoldsack1 commented 3 years ago

Hi Guy, will do! I'll tidy up my script and make sure everything is working, then create a PR :-)

levguy commented 3 years ago

Thank you Tomas! :-)