lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Ideas from Web Archives 2015 #161

Closed mcburton closed 8 years ago

mcburton commented 8 years ago

As requested by Ian, I am opening up an issue for discussing ideas from our ad-hoc "spark warcshop" 😉

mcburton commented 8 years ago

So I would like to see more integration with pySpark and Jupyter Notebooks. Basically, I'd love to see a workflow that integrates warcbase with this tutorial on getting pySpark running with Jupyter.

jrwiebe commented 8 years ago

Stay tuned! :)

On Fri, Nov 13, 2015 at 11:53 AM, mcburton notifications@github.com wrote:

So I would like to see more integration with pySpark and Jupyter Notebooks. Basically, I'd love to see a workflow that mirrors this tutorial on getting pySpark running with Jupyter https://www.dataquest.io/blog/installing-pyspark/, but also have the warcbase functions available too.

— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/161#issuecomment-156486069.

ianmilligan1 commented 8 years ago

Just to give a sense of what we're doing, I created this wiki page for our workshop. Really appreciate the suggestions! // @lintool

ianmilligan1 commented 8 years ago

We had some issues in the notebook with loadWarc vs loadArc. We should write some sample scripts with the former.

Link extraction should be up and running in the notebook too.

Also discovered that you can share notebooks in GitHub, if saved as an iPython notebook - i.e. here. Didn't save graphics though.

mcburton commented 8 years ago

warcbase as Jupyter kernel (in python and or scala). http://jupyter.readthedocs.org/en/latest/subprojects.html#kernels

lintool commented 8 years ago

We had some issues in the notebook with loadWarc vs loadArc. We should write some sample scripts with the former.

If it's a bug, please open an issue.