jeroenjanssens / data-science-at-the-command-line

Data Science at the Command Line
3.77k stars 766 forks source link

Second edition: rationale, changes, outline, and feedback #101

Open jeroenjanssens opened 4 years ago

jeroenjanssens commented 4 years ago

I'm happy to announce that I'll be writing the second edition of Data Science at the Command Line (O'Reilly, 2014). This issue explains why I think a second edition is needed, lists what changes I plan to make, and presents a tentative outline. Finally, I have a few words about the process and giving feedback.

Why a second edition?

While the command line as a technology and as a way of working is timeless, some of the tools discussed in the first edition have either: (1) been superseded by newer tools (e.g., csvkit has been replaced by xsv), (2) been abandoned by their developers (e.g., drake), or (3) been suboptimal choices (e.g., weka). Since the first edition was published in October 2014 I have learned a lot, either through my own experience or through the useful feedback from its readers. Even though the book is quite niche because it lies at the intersection of two subjects, there remains a steady interest from the data science community. I notice this from the many positive messages I receive almost every day. By updating the first edition I hope to keep the book relevant for at least another five years.

Changes with respect to the first edition

These are the general changes I currently have in mind. Please note that this is subject to change.

Book outline

In the tentative outline below, :new: indicates added and :x: indicates removed chapters and sections with respect to the first edition.


In the past five years I have received a lot of valuable feedback in the form of emails, tweets, book reviews, errata submitted to O'Reilly, GitHub issues, and even pull requests. I love this. It has only made the book better.

O'Reilly has graciously given me permission to make the source of the second edition available on GitHub and an HTML version available on under a Creative Commons Attribution-NoDerivatives 4.0 International License from the start. That's fantastic because this way, I'll be able to receive feedback during the entire journey, which will make the book even better.

And feedback is, as always, very much appreciated. This can be anything ranging from a typo to a command-line tool or trick that might be of interest to others. If you have any ideas, suggestions, questions, criticism, or compliments, then I would love to hear from you. You may reply to this particular issue, create a new issue, tweet me at @jeroenhjanssens, or email me; use whichever medium you prefer.

Thank you.

Best wishes,


aborruso commented 4 years ago

@jeroenjanssens it's really a great thing, thank you very much.

A note about your "scrape": the only real problem for me was that it did not work on python3, and for this reason I had built a cli based on it (and I do not must think to the environment).

You are right, pup is faster and easier to install, but you cannot do XPATH query using it. I think that if you must use a cli tool to query HTML pages, it's necessary to use something that is able to run both CSS selector and XPATH queries, as your GREAT scrape.

jeroenjanssens commented 4 years ago

Thank you @aborruso!

For the second edition I would like to only use tools which can be installed easily through some package manager. So to address your point, I guess we could do two things:

  1. Create a separate package for scrape.
  2. Extend pup such that it accepts XPATH queries.

What do you think?

knbknb commented 4 years ago

Off the top of my head:

I remember that one of the reviewers of the first edition of this book on wrote that he very much liked your introduction to gnu parallel. That supposedly was a highlight of the book.

So maybe split chapter 8 into two chapters: one chapter about parallel processing on localhost, and one chapter about parallelization on cloud platforms.

aborruso commented 4 years ago

So to address your point, I guess we could do two things:

  1. Create a separate package for scrape.
  2. Extend pup such that it accepts XPATH queries.

Dear @jeroenjanssens, both are very good points.

But unluckily I'm above all a final user and not a Python or go developer. I have built the cli version of scrape, using another utility :) Then I cannot say to you I will help you to create the package or extend pup :(

If there is scrape package it will become a tool which can be installed easily through some package manager.

Once again thank you

iveksl2 commented 4 years ago

Hmmm, maybe something about model deployment? Not sure how it fits into the command-line but some buzzwords to think about in Deep Learning, Optimization, RL?

kwbonds commented 3 years ago

Thanks for your book. Suggest you consider switching to printf instead of echo in the second edition though. Seems it is more stable. I spent a while trying to figure out why echo 'foo\nbar\nfoo' would not recognize the newline characters. printf 'foo\nbar\nfoo' works correctly.

simonw commented 3 years ago

I'm here to advocate for more SQLite coverage.

SQLite is a fantastic tool for command-line data science, because it gives you a full relational database without needing to run a PostgreSQL or MySQL server anywhere - each database exists as a single file on disk.

My sqlite-utils tool (brew install sqlite-utils) lets you pipe JSON or CSV data directly into a database, automatically creating an appropriate scheme. You can then run queries and pipe the results out as further JSON/CSV ready to be piped to other processes.

While I'm here I'll plug Datasette too (brew install datasette) which gives you an instant localhost web UI for exploring a SQLite database (datasette mydb.db) - and can also export CSV or JSON results of queries back out again.

(Originally discussed on Twitter)

Awannaphasch2016 commented 3 years ago

here is also a list of command line related tools for further reading.

PythonCoderUnicorn commented 2 years ago

Thanks for sharing the link. I appreciate your hard work. Hope to learn a lot