swcarpentry / DEPRECATED-bc

DEPRECATED: This repository is now frozen - please see individual lesson repositories.
Other
299 stars 383 forks source link

What tool(s) do you use to manage data analysis pipelines? #375

Open gvwilson opened 10 years ago

gvwilson commented 10 years ago

We all have our favorite tools for getting, crunching, and visualizing data; what I'd like to know is, how do you stitch them together? Do you write a shell script? Do you use a Makefile? Do you drive everything from Python, Perl, or R (and if so, how do you handle tools written in other languages)? Do you use web-based workflow tools, and if so, which ones?

selik commented 10 years ago

In general, I prototype everything in an IPython Notebook, then gradually refactor into Python scripts that can chain together via command line pipes. I like to end up with something like this:

$ python collect.py | python wrangle.py | python analyze.py | python report.py

If useful, I'll break out longer parsing, wrangling, and analysis code into their own libraries.

LJWilliams commented 10 years ago

I tend to do my data visualization in R most of the time. I like that it makes pretty graphs without too much work. Sometimes I will use Python, but the final graphs from R are generally more attractive. If I need to stitch things together I tend to use a shell script.

sveme commented 10 years ago

For bioinformatics I started out using shell scripts, however, I'm trying to transfer most of my workflows to julia as they provide a neat way for shelling out commands, something that is unfortunately really big in bioinformatics. Something like this:

using Report, JSON, Gadfly, DataFrames

picardmultiple = `java -jar -Xmx4g $picard/CollectMultipleMetrics.jar INPUT=$infile OUTPUT=$outfile ASSUME_SORTED=false REFERENCE_SEQUENCE=$refseq`

run(picardmultiple)

Nasty, I know, but that's bioinformatics I guess. You construct shell commands using the backtick operator then run it or, and that's the major advantage, detach it or run it in some subprocess.

gvwilson commented 10 years ago

@selik What do you then use to manage those pipelines? Do you store each one in a single-line shell script with a descriptive name for later re-use? Or do you trust yourself to reconstruct the necessary pipelines on the fly?

gvwilson commented 10 years ago

@LJWilliams @sveme So if one of your input data files changes, or if you tweak a parameter, you regenerate affected results manually?

jkitzes commented 10 years ago

I do basically what I teach here, wrapped in some exploratory analysis. The process goes sort of like this (although the real world is, of course, messy and iterative, not linear like the below) -

  1. Look at everything in a IPython notebook, use pylab magic to quickly look at the data and make/examine graphs.
  2. Once I nail down analysis steps, refactor main "scientific" functions/classes into a python module. Add scripts for do_analysis.py (loads module, loads data file, crunches numbers and saves output results in some format) and a make_tables.py and/or make_figs.py file (loads output results from do_analysis and makes attractive).
  3. Use a controller runall (shell or Python) to run all scripts, in order, and save into a results directory. In some sense this is the pipeline part, although it's often pretty lightweight.
  4. Write manuscript in LaTeX - every time it compiles, it loads updated figures. (Tables are a problem that I've never effectively solved.)

If my inputs change or parameters change, I just change them and run the runall again. This requires, of course, that the entire pipeline not take too long - if it does, I break up the steps and rerun those that change (manually - a Makefile works in principle here, but I don't generally bother). As a failsafe, I always delete my entire results directory and rerun everything before writing/submitting results somewhere, just to make sure I haven't mucked something up.

For alternate simultaneous parameter sets, I'll create multiple results subdirectories and basically run the entire set of steps above within each subdirectory, loading the appropriate parameter file each time. This process is controlled by runall.

LJWilliams commented 10 years ago

At first I regenerate manually. Once I'm happy with the pipeline I try to put the elements of the script that change into functions and parameters to run the code from the command line. This last step doesn't always happen (even though it should).

selik commented 10 years ago

@gvwilson I make extensive use of argparse to document usage. Like @jkitzes, I store intermediate steps as files and could make use of Makefiles in principle. I rely on memory to track what needs to be re-run.

sveme commented 10 years ago

@gvwilson No, bioinformatics data often come in files that I put into one folder with a certain naming scheme. I then run the whole analysis or parts of it on the whole folder or on files with certain tags. I have written some functions to parse the (file) output from the commands (such as picard, samtools) into julia and then create plots or tables. I also started to write a small julia package Report.jl that generates Markdown documents with tables and figures within my workflow scripts. In the end I run pandoc to create a nicer looking pdf or odt file.

All in all, quite similar to @jkitzes workflow.

DamienIrving commented 10 years ago

@gvwilson My data analysis involves a mix of python scripts (that parse the command line using argparse) and numerous command line utilities that manipulate netCDF files, most of which are specific to the weather/climate sciences. To stitch them all together I use Make. In fact, I've found your seed teaching material on Make to be very useful.

karinlag commented 10 years ago

@gvwilson I usually end up making python functions for each step in my analysis, and end up with a wrapper script that runs the whole shebang. I can then, depending on the complexity of things, add sanity checking for each step, so that it doesn't just run ahead with bad results. While developing, I usually work with a small test set to help with debugging. If there are important parameters that could/should change, I usually end up adding a config file too.

TomRoche commented 10 years ago

@DamienIrving "numerous command line utilities that manipulate netCDF files"

Try NCL: basically a "little language" for netCDF (plus some dubious graphical utilities). Friendly syntax, plus (importantly, IMHO) a REPL (unfortunately not as good as Python's or R's).

For too long I wrangled netCDF using NCO and various IOAPI CLUs, then switched to David Pierce's R packages (notably ncdf4, which I still use whenever getting in and out of R is more pain than gain). But for straight-up netCDF twiddling (put this in, take this out, simple math), NCL rules. Plus it has Python API.

dpshelio commented 10 years ago

In astrophysics we have a lot of tools that communicate each other through SAMP Protocol. Through this protocol one can transfer directly tables from a website (like from vizier catalogue) to a table visualiser (eg. topcat) with just one click. After some merging/filtering there with other tables I could transfer it to python (with astropy). The same workflow can be done with images connecting different imaging software (eg. aladin)

Also, there's a plugin for Taverna (a workflow management system) that it connects through SAMP to all these tools.

The issue/flaw I see in this is the difficulty of reproducibility of the whole process. I find it awesome for testing things quickly, because they are all connected in a seamlessly way. However, if I would have to save it all for publishing my results or to repeat it later I probably would write it all in a python script (I feel identified with this blog post).

mkcor commented 10 years ago

At work we use cron. My intermediate outputs are CSV files so they can be inputs to programs in either language (R or Python, in my case).

synapticarbors commented 10 years ago

A website that I've enjoyed over the years is http://usesthis.com/. It's composed of short interviews with four basic questions posed to people in various fields:

I've found it to be a great resource for finding out about software and services. Maybe it would be interesting to make a variation for scientists and host it off of the SWC website and formalize something like what is being posted here. There are already a number of spin-offs: http://usesthis.com/community/

and frame it as a way for scientists in a range of disciplines to learn about the tools that people are using that promote productivity and reproducibility.

drio commented 10 years ago

@synapticarbors I have built my own version. The engine is python based instead of ruby. The look is also substantially different to the usesthis version.

I am trying to make it oriented towards scientists and data scientists (also developers). I am waiting to get interviews from people to have a buffer so I can release an interview every week. With the amazing pool of interesting people at software carpentry I should be able to get enough interviews to release fairly quickly.

Let me know what you think.

joonro commented 10 years ago

I do everything in Python. Using one language for everything streamlines the transition between various component of research (if you use different language for different task, you have to constantly change your mindset and also data transfer is a pain as well).

I usually have scripts for processing raw data and generate the dataset (in hdf5 format) for analysis, and then another script for reading the dataset and actually run analyses on it.

For reproducibility, it is crucial to have command line switches for different options for these scripts, instead of modifying actual source code for different options) I use docopt for that.

Finally, I put actual commands in an IPython notebook. In the notebook, I have sections for data and analysis, and each cell has a command to run a script. For example:

Datawork

%run data_generation.py --option1=1 --option2=2

Analysis

%run do_analysis.py --option1=1 --option2=2

IPython notebook is very convenient since you can store the command line commands that you use to generate data and results, and you can remind yourself of the workflow.

lexnederbragt commented 10 years ago

Funny you should ask, as we're now starting to look into this ourselves. We are at the beginning of a large project, mutliple datasets generated over time, some samples will give data files multiple times, different naming schemes of the data providers, running the same analysis on each data file, collective analyses according to sample, sample origin, other combinations of samples, wanting to redo part of the analysis, or add analyses or new data. I just learned how to use make (not easy), and came over ruffus, which I like as it is written in python, with make-like capabilities. Others point me to snake-make, or ... or ...

iglpdc commented 10 years ago

I'm very interested in this. I found that lacking a good workflow to run simulations and analyze results was my main source of frustrations as a scientist. Also, as happens with all this, the root of all evil was the lack of proper instruction in basic software skills. At least for most of the people in my field, there is a huge room for improvement in the efficiency if they adopt a good (or, at least, "one") automated workflow. I've also found that most tools are domain-specific or make some assumptions that are incompatible with your needs. So, I setup my own thing and more or less worked.

I use make to compile my C++ code and Python and shell scripts to create input files and analyze the results. After many years of struggles, I managed to automate the most of the process, so from my workstation or laptop, I could create the parameter files, submit the jobs, and make an automatic directory structure to keep the results organized (based on this paper). Basically, I removed the possibility of making decisions about how to name the files, where to store them, etc... My setup included a hook that auto-commits the results to a svn repo that lives in each cluster (that was the old days when I didn't know about git), and a sqlite database to keep track of the results.

It had many pitfalls, but was a huge improvement over my previous workflow of "just type everything in the command like each time", which, sadly, is what most people around me did and still do.

Some things I would like to improve are:

joonro commented 10 years ago

I attended a talk about Sumatra at SciPy 2013 and it seems it is a tool aiming for exactly this:

Sumatra: automated tracking of scientific computations Sumatra is a tool for managing and tracking projects based on numerical simulation and/or analysis, with the aim of supporting reproducible research. It can be thought of as an automated electronic lab notebook for computational projects.

mbjones commented 10 years ago

Here are some of the analysis tools and systems in use in the ecological and environmental science community:

For analysis, common tools include:

  1. Excel (ug, but yes)
  2. Jump
  3. R, Matlab, SAS, IDL, Mathematica, for general stats and models
  4. ArcGIS, rgeos, GDAL, GRASS, QGIS for GIS analysis
  5. Specialty stats packages such as Primer, MetaWin, HydroDesktop, etc.

A small portion of the community has begun to think about reproducibility and scriptability, provenance, and re-execution. That small segment uses various workflow tools, such as:

  1. R, SAS, Matlab, as scripted analysis environments
  2. Kepler, Taverna, VisTrails, and other dedicated workflow systems
  3. Bash scripts
  4. Python, perl, and other scripting languages
  5. Pegasus, Condor, and related batch computing workflow systems
  6. Make (but far less common, except in its traditional use in building code for models)

There are a variety of publications on these issues and usages, especially in the scientific workflow community. The Taylor et al. (2007) book is a nice overview of a number of systems, many of which like Kepler, Taverna, and VisTrails have persisted over time (Workflows for e-Science: Scientific Workflows for Grids; http://www.springer.com/computer/communication+networks/book/978-1-84628-519-6). The dedicated scientific workflow community has thought through and implemented many advanced features (such as model versions, provenance tracking, data derivation, model abstraction). For example, Kepler and VisTrails support provenance tracking, keep track of model versions as users change them, and allow users to archive specific versions of model runs along with full provenance traces. Current work is on a shared, interoperable provenance model for scientific workflows that derives from PROV. There is an extensive literature on these systems, partly arising from annual workshops such as IPAW.

Is this ticket meant to be the start of a comprehensive survey? I'm curious what the intent is. I suspect that GitHub/Software Carpentry users do not represent a random sample of the science community, and it could be argued that they are specifically misrepresentative of the users in many scientific disciplines. So I would be cautious to not use this data as a survey to represent relative usage of various approaches in any particular discipline. But as a (biased, partial) list of tools in use in various communities, it would be useful to know what is out there. There have been other more comprehensive tool surveys (see e.g., the EBM Tools database, and the DataONE software tools catalog). Hope this is helpful.

rbeagrie commented 10 years ago

I'm a little late to the party here, but I'll add my two cents anyway.

Like @selik, @jkitzes and others have mentioned I prototype/do exploratory analyses in an iPython notebook. If the analysis is something I want to reuse, I'll then refactor it into a standalone python script, and I always make heavy use of argparse to try to make these scripts self documenting. I used to use Make for tying these pipelines together, but I didn't like the fact that I couldn't make the Makefiles self-documenting in terms of how and why to specify the different parameters. For this reason, I switched to a python build tool called doit, as it allows me to use argparse to document the parameters for the whole pipeline.

There's an example of one of my doit pipelines for handing sequencing data at:

https://github.com/rbeagrie/cookiecutter-tophat2mapping/blob/master/%7B%7Bcookiecutter.repo_name%7D%7D/make_bigwig.py

My impression is that the doit syntax, being much more verbose than make, is also easier to read. That's just my impression though, and I'd love people's impressions on how easy or hard it is to figure out what the above file is doing without necessarily knowing how the doit library works in advance.

Anyway, for anyone who is really interested, I've made some lessons for teaching doit to learners in #419 which also has a direct comparison of doit and Make.

joschkazj commented 9 years ago

Based on the features presented here (simple Python script, MD5 change detection), doit seems like a perfect fit for my Python-based workflows. So I was eager to try it after I read @rbeagrie's lesson. Until now I have been using waf based on the project template by Hans-Martin von Gaudecker (GitHub repository, which I modified for my needs.

After converting a recent project workflow from waf to doit all seemed well, until I tried the parallel execution option, which fails on Windows (pydoit/doit#50). As I have to work on Windows and rely on parallel execution for things like parameter optimization and classifier comparisons, I am stuck with waf.