Add rudimentary web scraping

tommeagher / heroku_ebooks

An archive of a script to generate Markov chains and to post to an _ebooks account on Twitter using Heroku. No longer actively supported.

264 stars 163 forks source link

Add rudimentary web scraping #40

Closed ConorIA closed 6 years ago

ConorIA commented 6 years ago

Added rudimentary web scraping, as implemented in: https://twitter.com/academic_ebooks, which pulls vocab from http://www.pnas.org/reports/most-cited.

tommeagher commented 6 years ago

Awesome. Thank you! I'll take a look and try to merge this soon.

ConorIA commented 6 years ago

@tommeagher, thanks. I think there are ways to improve this. Such as using multiple pages and a more flexible way to extract info using beautiful soup. I'll probably take another crack at this in the coming days, so maybe hold off on the merge for the time being.

tommeagher commented 6 years ago

@ConorIA cool, let me know when you're done and I'll take a look.

ConorIA commented 6 years ago

@tommeagher, I think that it is now as good as it is going to get. I bet there is much more that could be done, but I have exhausted my limited knowledge.

The last commit I made was a little presumptuous. Essentially it allows one to scrape website and use Twitter as a source as well. If you would rather have these sources siloed, let me know, and I will revert.

tommeagher commented 6 years ago

Superseded by #41