ireapps / pycar

NICAR Python mini boot camp
https://ireapps.github.io/pycar/pycar_intro.html
MIT License
102 stars 36 forks source link

Refactor project 2 #38

Closed tommeagher closed 7 years ago

tommeagher commented 8 years ago

The class this year was almost completely overwhelmed by project 2. It started to go astray somewhere around here.

There are many new concepts being introduced here:

I've always liked that this project seems more explicitly about data cleaning and analysis than any of the others, but we need to think about how we can reorganize and simplify this exercise.

scott2b commented 8 years ago

So, I have a confession. I did not really scrutinize the code for projects 1 and 2. I had made the assumption that everything would progressively lead up to project 3. As it turns out, it seems to me that there is much more complex and arcane material here in project 2 than in 3. The points that Tom makes here are spot on. Arcane dictionary updates via kwargs (when we have not even taught concepts of args and kwargs) strike me as undesirable and unnecessary.

I think that some of the identified issues can still be taught, but they need to be properly introduced. There seems to be a more fundamental issue at hand that the projects don't really build on each other in an incremental way. Not really.

With regard to these specific issues:

  1. dict syntax should be kept simple.
  2. functions should be introduced earlier - * and \ args and kwargs syntax could probably be avoided altogether, although at least standard positional arguments should be taught
  3. I am uncertain about try blocks. They are sort of important, but easy enough to avoid in code at this level. I'd maybe leave them to a refactoring session only if there is time at the end of the day.
  4. type coercion should be taught, and could easily be introduced much earlier in the day when the concept of type is introduced in general. It is particularly important (for this group) to be able to convert strings to numbers and vica versa.
  5. Nested for loops could probably be avoided altogether.
  6. I am uncertain about joins. If people find joins confusing, it might be best to avoid them. It does not seem like a particularly advanced subject, but since a lot of our output is csv using available csv utilities, I'm not sure joins are really necessary. Furthermore, it could be instructive to build a simple function that joins the contents of a list with a delimiter. While more verbose, this might actually be a gentler introduction to the concept of joining, since students could actually see what the code does
richardsalex commented 8 years ago

After some reflection and feedback from attendees, I'm not confident that a refactor is going to make this workable. Project 3 should probably be moved to Project 2's spot (I was leaning this way before PyCAR and in hindsight wished I would've suggested it for the group to consider), and I believe there needs to be a rethinking of a simpler, practical project that introduces some of the necessary concepts for beginners, like dictionaries and making your own function, without the attempt at covering args / kwargs, keying data and joins — especially as they relate to the chosen baseball data. There's a one-to-many relationship between the files that we didn't even get a chance to address.

scott2b commented 8 years ago

The reorganization makes sense to me. I am, however, a bit wary of venturing at all into the realm of relatedness, cardinality, etc. I sense this might be a bit much for a one-day workshop. It introduces a fair amount of complexity. Whereas, if our target is people who understand Excel, maybe it would make sense to take a simpler approach to data extraction, extracting into multiple CSVs and handling the relatedness directly in Excel. It's just a thought -- I don't think we should be teaching Excel, but it might be just an example that ties into existing knowledge. Some basic data manipulations are certainly within scope -- and maybe even a multi-pass thing that requires joining disparate sources. I'm just saying, we should err on the side of simple.

zufanka commented 7 years ago

If we would like this project to still be about cleaning and analysis, how about we do some pandas? Pandas was made for the job. On top of that: once they go out there and start using Python in the "real world" there is no way around using libraries anyway.

esagara commented 7 years ago

My concern with incorporating pandas is that it has a pretty complex API with some potentially confusing sticking points such as dataframes and arrays. And every time I've worked with pandas I've ended up dipping into numpy...

zufanka commented 7 years ago

You are right. I believe after seeing just pure python, pandas could be quite confusing at first. I have recently taught pandas journalists with zero prior knowledge of programming (see for example this sheet). As you can see I kept it very practical, really interviewing the data. It worked quite well in the sense that they:

  1. understood the logic behind it quite fast
  2. understood how and why it can be useful, which a question you will get after most of the coding lectures for journalists.

It is true that this was a 3 hours lesson, but on the other hand I was also alone to assist to each error that has occurred to anyone.

tommeagher commented 7 years ago

I think for all the reasons outlined above, we should drop the "project 2" we did last year with the baseball stats.

Looking at the schedule now, we might have an hour at the end of day two for another project to add to the mix. As @esagara and @zufanka point out, that might not be enough time to really get people comfortable with pandas, especially when they're still struggling to wrap their heads around the basics.

Is there another, small exercise that we should consider that'd involve some data munging or analysis that would appeal to CAR types? I'm open to suggestions.

robroc commented 7 years ago

@tommeagher I'm very new to this thread and heaven't taken a good look at the projects. But instead of pandas, what if you introduced @onyxfish's Agate library for data work? It works well in Jupyter and was made specifically to be more explicit in syntax. I think Matt Waite uses it in his class and he swears by it.

hbillings commented 7 years ago

Back in The Day, the genesis of this project was to introduce different datatypes people would encounter. I haven't yet looked at how everything evolved last year, but I'm wondering if maybe this time would be well spent doing some much smaller exercises that demonstrate things like how to check for a key in a dict, how a list is different from a dict, how to iterate over things, etc. Without a solid understanding of how data is structured in Python, people are going to have a hard time using any library. (That said, Agate is pretty wonderful. Been ages since I used it, but I remember it being fairly straightforward.)

zufanka commented 7 years ago

I agree that a solid understanding of data types is important to be an efficient programmer. However, I was writing scrapers for some time before ever learned how to actually use dictionaries. Therefore I believe that it is not essential for applied usage.

What is the main focus of PyCar? Learning Python step-by-step for a solid basis or an applied usage of Python for journalists?

tommeagher commented 7 years ago

The main focus is definitely getting journalists enough knowledge to be comfortable using Python and basic programming in their reporting. We're not training engineers; this is all about journalism.

That said, if we can impart best practices that will make their lives easier down the road, we want to do that. We will deal with dicts briefly in the intro, so we don't necessarily have to go into great depth with them later in the day.

The question is this: is Agate the best way to wrap up the day with a data focus, and can we reasonably do enough of it in ~45 mins to make it meaningful to programming newcomers?

If the answer to that is, "Yes," my next question is, "Who can/will write the code and teach that lesson?" On Mon, Jan 23, 2017 at 8:32 AM Adriana notifications@github.com wrote:

I agree that a solid understanding of data types is important to be an efficient programmer. However, I was writing scrapers for some time before ever learned how to actually use dictionaries. Therefore I believe that it is not essential for applied usage.

What is the main focus of PyCar? Learning Python step-by-step for a solid basis or an applied usage of Python for journalists?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ireapps/pycar/issues/38#issuecomment-274488453, or mute the thread https://github.com/notifications/unsubscribe-auth/AArkMng9ENZ-B_0SJUEyfDMrIcoPoIAOks5rVKvsgaJpZM4HxaSl .

zufanka commented 7 years ago

The question is this: is Agate the best way to wrap up the day with a data focus

Not sure about the best way. But it might be refreshing to end the day with some simple one liners.

can we reasonably do enough of it in ~45 mins to make it meaningful to programming newcomers?

I believe we can. But it can't go too far in of course.

"Who can/will write the code and teach that lesson?"

If nobody disagrees I'd be happy to tackle this.

tommeagher commented 7 years ago

@zufanka, if you're willing to give it a shot, go for it. Do you think you could put something together in the next week or two, so we can all give it a look and weigh in before the conference?

zufanka commented 7 years ago

@tommeagher sure! I have been wanting to look into agate for some time anyway. Could you maybe help me with suggesting an interesting / relevant data set for the audience?

robroc commented 7 years ago

If I may suggest another useful tool for working with data: Counter in collections, for finding the most common elements in a list, or a column of data. It could also show how rich the standard library is on its own.

zufanka commented 7 years ago

Please see project 2 refactored in agate and in pandas

I have spend several hours looking into agate. It took some effort getting into, as while pandas is really dry, agate is quite verbose, creating sort of sentences in the code. I see how that adds to the readability by humans (which is what agate is supposed to do)

For example:

Who are the top 10% highest-paid players?

agate percentiles = joined.aggregate(agate.Percentiles('salary')) top_ten_percent = joined.where(lambda r: r['salary'] >= percentiles[90]) ordered = top_ten_percent.order_by('salary', reverse=True) ordered.select(["nameFirst","nameLast","birthYear","birthState","salary"]).print_table()

pandas top_10_p = deduplicated["salary"].quantile(q=0.9) best_paid = deduplicated[deduplicated["salary"] >= top_10_p] best_paid.nlargest(10, "salary")["nameFirst","nameLast","birthYear","birthState","salary"]

My preference would nevertheless still go to pandas. This for three reasons:

  1. pandas has a bigger user base = more people with problems asking on the internet about it. Learning agate I could only find a single entry on stackoverflow. It's much easier to adapt a library in your work-flow if a lot of questions you may have are 'googleable'.
  2. there are multiple salary entries for many of the players. Could not figure out how to chose either the latest entry or handle row (de)duplication in agate.
  3. Some of the operations are way more straightforward to do pandas. (agate code below adapted from the cookbook)

Examples of the same thing (see also the sheets)

What are the most common baseball players salaries?

agate: binned_salaries = joined.bins('salary', 10) binned_salaries.print_bars('salary')

pandas: deduplicated.hist("salary")


What is the average (mean, median, max, min) salary?

agate: joined.aggregate(agate.Mean('salary')) joined.aggregate(agate.Median('salary')) joined.aggregate(agate.Max('salary')) joined.aggregate(agate.Min('salary'))

there is a way how to do several at once but it gave me an error that I could not fix

pandas: deduplicated["salary"].describe()


Curious what do you think!

tommeagher commented 7 years ago

@zufanka this is awesome! Thanks so much for taking this on, and you did it in Jupyter notebook, which is great. Looking at the code, I like the pandas version personally, but that's probably because I've used pandas a lot more than agate. What do others think?

hbillings commented 7 years ago

Haven't gotten a chance to review yet, but two thoughts here: 1) I don't think we should be beholden to the project structure, data or examples. If the aim of this project is to introduce working with a particular library, maybe there are larger chunks of this that need to change. 2) What if we break between projects 1 & 2 to describe the Python ecosystem a little bit better? It might help to talk about libraries in the abstract, what they are, how to find them, etc. before we jump into using one. (I remember taking Serdar & Jeremy's intro class back in the day and being super confused about what the hell this thing called BeautifulSoup was.) EDITED TO ADD: part of my thinking on that last bit is that it would give us a nice high level/hands-on/high level/hands-on rhythm.

zufanka commented 7 years ago

Thank you for your thoughts Heather.

  1. I do think that the project data and objectives are perfectly valid for the students needs: it does a bit of data cleaning and joining of two data sets which is really useful in practice. But please have a look and feel free to adjust.

  2. Sounds like a good idea!

On Feb 14, 2017 6:56 PM, "Heather Billings" notifications@github.com wrote:

Haven't gotten a chance to review yet, but two thoughts here:

  1. I don't think we should be beholden to the project structure, data or examples. If the aim of this project is to introduce working with a particular library, maybe there are larger chunks of this that need to change.
  2. What if we break between projects 1 & 2 to describe the Python ecosystem a little bit better? It might help to talk about libraries in the abstract, what they are, how to find them, etc. before we jump into using one. (I remember taking Serdar & Jeremy's intro class back in the day and being super confused about what the hell this thing called BeautifulSoup was.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ireapps/pycar/issues/38#issuecomment-279783816, or mute the thread https://github.com/notifications/unsubscribe-auth/ACTKOupouGh9dbLK4CzzG6wwlvVLEcApks5rceq6gaJpZM4HxaSl .

tommeagher commented 7 years ago

Closed by 123e83bdf530b2cc067bd24e748894f7108316e2