scicloj / tutorials

A repo for hosting Clojure data science tutorials created by the community
15 stars 7 forks source link

check my first version of rotten-tomatoes analysis #5

Open behrica opened 5 years ago

behrica commented 5 years ago

I added a first version, just to see how it goes. Its in it my fork here: https://github.com/behrica/tutorials/tree/master/src/drafts

I added as well the data files, as needed.

So you should be able to "run" it by using "jupyter notebook". Please try it.

All dependencies are added dynamically in the beginning of the notebook. I was not sureif to commit it with all result cells "empty", or not.

I am not sure how in details github handles ipython files regarding images and embedded html... And indeed, the vega-lite plots don't appear.

So I though as well that we should publish a html version as well. It looks better, as the plots are shown. This needed to be in a "githhub site", I put it in my personal for the time being:

https://behrica.github.io/tutorials/rotten-tomatoes-sentimen-analysis/sentiment-analysis-rotten-tomatoes.html

Please provide me with your comments

behrica commented 5 years ago

I just saw that github provides a little button inside the rendered notebook file, which says "render via external nbviewer" and that renders it in full, including the vega plots. That's great, so we don't need the html version.

daslu commented 5 years ago

Wonderful, @behrica! That is so nice.

It works at my machine after installing latest version of clojupyter ("0.2.1-SNAPSHOT").

Imho, we should rather not put in the repo large data files (or any large files). It makes the whole git experience slower (and deleting the files does not remove them from the history). Probably it would be better to put a script (or clojure function) that brings the data from its source, like @cnuernber did here: https://github.com/cnuernber/ames-house-prices/tree/master/scripts For safety, we can add the data subdirectory to .gitignore. Seems reasonable?

Regarding large rendered notebooks, we need to think of a solution. Maybe having them all rendered in one repo would be too heavy.

behrica commented 5 years ago

Ok, maybe to added to the guideline: Please don't put data files

I will find a way to change my notebook to download the data.

Regarding rendered notebooks...(html or others) Yes, they can become big, with plots and if using Vega the data is always part of the javascript for Vega, so can become big as well. Not sure, what to do with them.

For me the ipynb file is already big, as the Vega stuff puts the data is in there all the time. unless I "clear" actively all cells before the last save...

That's a Vega specific thing and can be avoided by saving the data to file first and then the Vega points to it...Not ideal for interactive working.

There are "tons" of possibilities on were people might want to have their rendered html files. Maybe we leaf it to them, and just allow a "link" in the table.

alanmarazzi commented 5 years ago

You just hit one of the major limitations of notebooks: versioning simply doesn't work as intended with them. I guess we have to live with that if we want to show off stuff directly from GitHub and/or nbviewer (or https://mybinder.org/), but size shouldn't be an issue, there are very large repos with a lot notebooks and they work pretty well

daslu commented 5 years ago

Thanks @behrica , good idea, I added the guideline.

@alanmarazzi it is good to hear that you experience no problems with large notebooks (I remembered something different, but maybe it was an extreme case).

So let us, for now, keep working with notebooks rendered in git, and if and when we meet a problem, we can think what to do.

alanmarazzi commented 5 years ago

This is a nice example of what is achievable in terms of tooling/presentation and size: https://github.com/jakevdp/PythonDataScienceHandbook

ezmiller commented 5 years ago

You just hit one of the major limitations of notebooks: versioning simply doesn't work as intended with them. I guess we have to live with that if we want to show off stuff directly from GitHub and/or nbviewer (or https://mybinder.org/), but size shouldn't be an issue, there are very large repos with a lot notebooks and they work pretty well

Innocent question here: Was using gorilla-repl ever something you all considered for the tutorials? I've never used it and therefore only have a sketchy understanding of how it fits in, but I noticed that unlike .ipynb files, theirs save as something like normal clojure code. So versioning might work a bit better...

daslu commented 5 years ago

Thanks @ezmiller, you're right, versioning would be better with text-based formats such as gorilla and org-mode.

I guess it is not a reason not to use Jupyter here, but indeed a limitation to keep in mind.

ezmiller commented 5 years ago

Yes. Definitely not a reason to change anything on work done here by @behrica :). I was just curious if there'd been a discussion more generally about gorilla-repl. I think I'll take this dicussion into Zulip.