Merging Intermediate Python Material from LBL-WISE Bootcamp

abostroem commented 10 years ago

For the LBL-WISE bootcamp I modified and reformatted the intermediate python material.

01-intro-python.ipynb
- combined modularization and intro notebook. This notebook is meant to be the whole learn python /best practices lesson
- Switch from Pandas to numpy
04-testing_master.ipynb
- This is a testing module more focused on unit testing and nosetests. It incorporates some concepts from 03-qa.ipynb. I'm sure it could use some improvement.
- This more of a novice that intermediate testing, but most intermediate users are novice testers
- This material builds on that taught in the 01-intro-python.ipynb lesson
plot_temperature.py - written as part of the 01-intro-python.ipynb lesson
test_plot_temperature.py - written as part of the 04-testing_master.ipynb lesson

gvwilson commented 10 years ago

Looks good to me - @ethanwhite, can you have a quick look, and if you approve, we'll merge.

ethanwhite commented 10 years ago

Apologies for the delay. Lots of travel, no time to keep up with email. I should be able to take a look at this by the end of the week.

ethanwhite commented 10 years ago

nbviewer links: http://nbviewer.ipython.org/github/abostroem/bc/blob/master/intermediate/python/01-intro-python.ipynb http://nbviewer.ipython.org/github/abostroem/bc/blob/master/intermediate/python/04-testing_master.ipynb

ethanwhite commented 10 years ago

Thanks for this @abostroem! I think there's two bigger questions and some details that are worth discussing. I'll add the bigger questions now and some more detailed feedback later.

Do we move to numpy. It's a more general data structure, which is good, but it also has two downsides. First, it is a worse data structure for this kind of tabular data and second it relates less well to data frames in R. The latter issue will make it more difficult to maintain related sets of material in R and Python and will make it more difficult for R users to make the switch to Python. As I've said in a couple of places I'm definitely open to making the change because it sets the user up to benefit from more general uses of numpy arrays, but I think it's worth having a public conversation about the pluses and minuses first.
Should the two notebooks be combined. I was modeling off of the other materials in keeping these two notebooks separate and those notebooks keep the material broken out into chunks of about the right size to to do the entire notebook before taking a coffee/lunch break. I think this was a good choice on Greg's part and therefore my preference would be to keep the material separated out into the original pieces. Again, I'm certainly open to discussion, but with not justification given I'm unclear on what the benefits of this choice may be.

ethanwhite commented 10 years ago

Minor suggestions on the Testing notebook (which looks awesome overall!):

Single Use Code

"Sometimes we write code" not "codes"

Repetitive Use Code

"Repetitive" not "Repetative"
"Functional programming" actually means something different to some folks
"that you can run" instead of "which"
"expect it TO in"

TDD

"when youR program"

Testing the product

[3] this line only works in IPython, which could be confusing

...oops, time to go pick up my daughter from pre-school. More later.

abostroem commented 10 years ago

@ethanwhite I'll make those detailed changes later this week. To address your 2 larger questions.

Numpy vs Pandas

I chose Numpy over Pandas for a few reasons:

I think it is a simpler concept for users. Most scientists I've encountered are familiar with arrays (and most expect lists to work like numpy arrays), so introducing them has a very low barrier.
Most other packages I've used (mostly scipy and astropy) use numpy arrays. I haven't played around with how they handle Pandas data structures. So teaching numpy vs pandas gives the users an introduction to tools that can be extended to other packages.
I didn't like the slicing peculiarities of Pandas
I wasn't ready to talk about classes and attributes as my first topic
The read_csv function seemed very specialized (although more reliable). I'm sure other functions exist in Pandas for other formats, but it means the student has to look up each one.
There is a clear mapping of data to variable. Instead of plot(table['col1'], table['col2']), you can write plot(col1, col2)

Single Lesson vs Multiple Lessons

I've found a lot of overhead switching between notebooks and I tend to favor lessons that build on each other, so my preference is for a large notebook for each section. To me the biggest downside is that it is harder to maintain. I am happy to discuss and defer to the group at large. When I was preparing I copies and pasted the lessons together (and rearranged them a bit if I remember correctly). I submitted them in the form I taught them.

ethanwhite commented 10 years ago

In the testing notebook:

It looks like [25] and [26] are duplicates.
I'd recommend starting with a test that doesn't require looping (i.e., test a single year first). This keeps things simple and avoids need to introduce zip, which I find can be fairly confusing for students the first time they see it.
The approach to looping over multiple values will (I believe) cause the test to stop executing after the first failure. So if testing multiple values we typically want to do something along the lines of https://github.com/ethanwhite/progbio/blob/master/lectures/testing.md#testing-multiple-values. That said, I'm definitely not an expert here so feel free to ignore me if I'm off base on this.

With those things address I think the testing notebook is ready to go and would be +1 for merging it via a separate pull request.

ethanwhite commented 10 years ago

I'm fine with moving to Numpy, but I'd prefer to use Numpy structured arrays rather than using multi-value assignment to split the columns up into separate variables (the later doesn't really work with large numbers of columns).

I personally prefer the notebook the bite-sized chunks for notebooks because it makes it easier to use these resources externally (as I'm currently doing in my university courses) if we don't aggregate material too much.

abostroem commented 10 years ago

Ok, I haven't gotten to the Testing materials (and the 03-qa.ipynb) but I split the one intro into multiple sections which kind of parallel what you originally had (except using Numpy). Here are the details of what I've changed: Ported Everything from Pandas to Numpy Intro Add more on types and loops to the intro section

Plotting This is a new section and deals explicitly with visualizing data

Modularization

I narrowed the scope of this section to just write functions
Moved plotting and line fitting to the previous section
Moved the conversion from F to C to the intro section
Replaced the square function with a function to convert F to C, then replaced the challenge with writing a plotting function
Removed the section on call stacks - this seemed like it could confuse students (one of those useful information, but only after you've done it a few times). It felt very abstract. But I'm open to being convinced otherwise
Included discusion of documentation in the process of writing functions
Moved testing discussion to a separate section
Removed explicit discussion of looping over files (although kept the concept of using a variable for filename so you could input any file)
Added part at end to move functions into a file and demonstrate the difference between running from the command line and importing into an interpreter

I broke my sections into: 01-intro-python 02-plotting 03-modularization

Given that we have discussing creating parallel numpy and pandas material and that with my new files this in no longer a direct update of the previous files, how should we proceed with this?

ethanwhite commented 10 years ago

Given the overall response to your email on discuss, and the impending proposal to split up bc into separate repos and allow folks to maintain parallel versions of lessons in separate repos, I'd recommend that we go with two repos, one for numpy and one for pandas. This will also let the numpy repo move towards a set of data and analyses that makes more sense to other groups if you wanted to do so (in the past folks who wanted to use numpy instead of pandas also didn't like the inclusion of regression; again, different scientific cultures/needs).

If we go this route then if/when bc gets split up we'd need to make two repos based on the current intermediate Python material and your changes would go into the numpy one.

I'd recommend splitting the testing material out into a separate PR. This could easily go in now (even without the additional changes I've suggested) and then get improved upon from there.

tbekolay commented 9 years ago

+1 to having two versions of this around. I presented the NumPy version last week, and it was generally well received (I had a slightly better experience than with the novice materials, in any case, as to me the mosquito data is easier to explain than the inflammation data).

swcarpentry / DEPRECATED-bc