ireapps / coding-for-journalists

A repo to support IRE's multi-day Python bootcamp for journalists
http://www.ire.org
MIT License
48 stars 16 forks source link

Give examples using easier libraries such as rows #15

Open turicas opened 8 years ago

turicas commented 8 years ago

Hello,

I'm working on a library which makes the use of tabular data pretty easy, no matter the format: CSV, XLS, XLSX, HTML etc. It's called rows. I think it would be great to add a section with examples using this kind of library, since the learner can access data with simple commands and don't need to understand about the format upfront.

An example: reading the CSV file from coding-for-journalists/2_web_scrape/completed/fun_with_csv_done.py with rows is as easy as:

import rows
for row in rows.import_from_csv('my_test.csv'):
    print row.FIRSTNAME, row.CITY

If the same was only available on XLS, you could use this code:

import rows
for row in rows.import_from_xls('my_test.xls'):
    print row.FIRSTNAME, row.CITY

So the interface is the same, no matter the format. I think it helps who is learning the basics -- then, they can dig deeper and learn more about each specific format.

Note: rows will automatically identify and convert the data (in this case there are just strings, but it will convert automatically to int, float, datetime.date, datetime.datetime, among other types if it detects there is information of this kind inside the file -- and this is true for all formats available), so you don't need to explain data conversion upfront but can actually show some examples of converted data being analyzed which is very motivational.

richardsalex commented 8 years ago

Thanks for the suggestion; it's something to definitely think about going forward as this evolves. On Tue, Jul 19, 2016 at 7:52 AM Álvaro Justen notifications@github.com wrote:

Hello,

I'm working on a library which makes the use of tabular data pretty easy, no matter the format: CSV, XLS, XLSX, HTML etc. It's called rows https://github.com/turicas/rows. I think it would be great to add a section with examples using this kind of library, since the learner can access data with simple commands and don't need to understand about the format upfront.

An example: reading the CSV file from coding-for-journalists/2_web_scrape/completed/fun_with_csv_done.py with rows is as easy as:

import rowsfor row in rows.import_from_csv('my_test.csv'): print row.FIRSTNAME, row.CITY

If the same was only available on XLS, you could use this code:

import rowsfor row in rows.import_from_xls('my_test.xls'): print row.FIRSTNAME, row.CITY

So the interface is the same, no matter the format. I think it helps who is learning the basics -- then, they can dig deeper and learn more about each specific format.

Note: rows will automatically identify and convert the data (in this case there are just strings, but it will convert automatically to int, float, datetime.date, datetime.datetime, among other types if it detects there is information of this kind inside the file -- and this is true for all formats available), so you don't need to explain data conversion upfront but can actually show some examples of converted data being analyzed which is very motivational.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ireapps/coding-for-journalists/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/ACYRidui7-ce4UPkBRSDWn35Q2kZia8gks5qXOSogaJpZM4JPz9A .

dannguyen commented 7 years ago

Adding my input for what I teach journalism students in a quarter-long course: I like to keep things as "plain" as possible. Stick to builtins when possible, e.g. csv, and work with the most common Python data structures, e.g. a list of dict objects returned from csv.DictReader().

The rows library looks cool, like other nifty data-wrangling wrappers such as pandas, agate, pudo/dataset etc, but novice programmers don't need easier ways to access attributes of a row object. They need affirmation of lists vs. dicts, str vs int, etc. More importantly, they need to know that data is text. And this requires fundamentally understanding what CSV purports to be, and how this relates to the actual realities of computing: computers don't have "intelligence", they need formats to be able to turn text into data structures, and there is a huge difference between a giant string, and that string deserialized as list/dicts.

Getting to that concept, and understanding for-loops and iteration, is all I ask of my students for my class. We don't get into making web apps, understanding OOP, doing data analysis or statistics, visualization, etc. -- it's all understanding text, patterns, and loops.

This is not just informed from how I've seen programming-learners struggle, but top-flight award-winning investigative journalists not have a clue about that the CSV they open up in Excel is just text. This misunderstanding of something fundamental is not just a non-trivial thing, but it leads to measurable problems when it comes to using that data for investigations.

So, this is all a long way of just saying, text, str, dict, list is just fine for students, IMHO :)