taborlab / FlowCal

Python Flow Cytometry Calibration Library
MIT License
48 stars 23 forks source link

Reorganization of excel relevant code #48

Closed castillohair closed 8 years ago

castillohair commented 9 years ago

Right now, the fc library itself only contains basic functions for reading and writing excel files, whereas most of the interpretation and work is made in run_excel.py. From a coding perspective, this is a little awkward, as the small excel part of the library is only there to support run_excel.py, but it's otherwise disjointed from the rest of the library.

There are two options to improve this:

I prefer the last option because 1. in principle one of the goals of the package is to be able to be used without coding, and 2. allows unittesting. However, we would have to decide what goes in and what doesn't. I'll think about this and probably handle this myself in the near future, but I welcome suggestions.

JS3xton commented 9 years ago

I agree with option 2; I like bringing it inside the library. Also not sure how best to do it, but in general I like that approach better.

On a kind of related note:

Your three example python scripts concretely implement a concept I was throwing around called a "workflow". I was entertaining the idea of having different "workflows" (basically just different Python scripts which analyze data in slightly different ways (different cytometers, 1D versus 3D clustering, median versus mode, etc.)) and somehow having the Excel UI be compatible with those workflows (so any "workflow" could be populated by the Excel UI). This seemed like it might be overkill, though, and I couldn't come up with a simple enough "workflow" implementation that didn't involve stacking on a bunch of extra constraints/restrictions that people wouldn't care enough about to figure out.

castillohair commented 8 years ago

The "workflow" idea is interesting, and has actually been suggested by other people as well (with slightly different details). However, I think that, for now, the only "workflow" that would be used by more than one person is the one implemented in run_excel.py. If we see that there are other common use cases, we should think more about this. Otherwise, I will put this idea aside for now, especially given that it may not be trivial to implement.

castillohair commented 8 years ago

I've come up with a general idea of what the Excel interface should look like. There will be a new module excel_ui, to replace the current excel_io, which will contain:

An additional thing would be to have a main function (__main__) that calls the workflow function, which would facilitate running the excel interface in a single line (if I understand correctly what's on https://docs.python.org/2/using/cmdline.html):

python -m fc.excel_ui

This line could be placed in a batch file (.bat in windows, .sh in linux) and run by double clicking from anywhere.

JS3xton commented 8 years ago

Originally I was unsure as to whether this functionality should exist in its own module or under the io module. I agree with what you have proposed here: separate module for all Excel-related functionality (renaming it helps reinforce the distinction).

I like the idea of being able to add new workflows (however they end up manifesting themselves).

Related to having a __main__ function, you could also consider making it a standalone script that can be run from the terminal in a UNIX-style fashion (pass all inputs to the high level function via command line arguments). I've made some of my older tools cute this way (e.g. primer finder, target sequence design tool, neural network promoter predictor wrapper). argparse module helps a lot for parsing input flags from the command line.

castillohair commented 8 years ago

I made a new branch excel-ui, and added an example input Excel file on commit ce3ffa91c26d38218782d11d70fc9579056ff980, file test/test_excel_ui.xlsx. Can I get your input on this file? @JS3xton @BrianLandry @thoreusc.

Some comments:

I'll let you guys figure out the rest, in part because how the spreadsheet works should be self evident if the design is good.

JS3xton commented 8 years ago

Thoughts:

BrianLandry commented 8 years ago

In general all of the changes are good.

Changed "FL1 Transform" to "FL1 Units" in the Samples sheet. I think this is less confusing to the user (you use "Arbitrary" to get arbitrary units instead of "Exponential"). In addition, this allows for some automation: using "Arbitrary" would perform the exponential transformation only if the file has been acquired with a log amplifier. I think having the user only care about the resulting units is simpler.

I like this, its needs a decent explanation somewhere, but in general it is good. I do agree with John objection to Aribtrary Units, at first I actually thought this meant the channel units. It seems you already touched upon the solution, just call it Linear if you are going to force them to be linear.

  • Does it need to be an Excel spreadsheet? Or can it just be a CSV? CSV seems more general to me if we're not using any formulas.
    • Not sure if this makes spreadsheet import/export more or less difficult. Maybe get rid of the xlrd and openpyxl package dependencies? I would recommend checking out pandas as it has pretty good CSV import functionality; just tends to work. User wouldn't need to know or use pandas either (just as they don't have to know xlrd or openpyxl).
    • I guess you need Excel to have tabs? Would need 3 separate CSV files to replace 1 Excel file, which may not be desirable. More UNIX-friendly, perhaps, but generally less user friendly.

Tabs are the main reason an excel document is used. This both simplifies the storage of the information pertaining to the experiment, but also is more user intuitive, some people might not know how to work with CSVs, not to mention excel is a pain with saving them.

New "Instruments" sheet, which implements ideas from #47. This allows for files from different flow cytometers to be used in the same spreadsheet.

How will they know the correct parameters to use with their instrument, considering they are not programmers (using the excel interface)?

JS3xton commented 8 years ago

Correct parameter names have been a problem. Short of actually seeing the innards of the FCS file, I'm not aware of any way of knowing what the acquisition software decided to name all of the parameters. We could expose some functionality where we list the channel names for a given input file (either in an Excel spreadsheet, or in the terminal. I favor the terminal if we're using it anyways to print status messages), from which the user could populate the Instruments tab.

The alternative is to have FlowCal try and guess what the acquisition software named the parameters, but @castillohair and I think that's tricky given all the parameter name variants we've seen. I favor offloading that in the user rather than run the risk of guessing the wrong channel (FSC-H? FSC-A? etc.).

castillohair commented 8 years ago

Thanks for your comments, guys. I have a few things add.

So far I'm only making one change, which is moving the time channel name to the end in the "Instruments" tab. I'll commit this change and work on code based on this. If someone feels strongly about changing something else, let me know.

castillohair commented 8 years ago

Forgot one thing: I have mixed feelings about changing the name "Samples" to "Analyses". It is more technically correct, and actually helps convey the idea that several analysis can be run on the same file. On the other hand, I think this makes it less intuitive. Most of the time, people want to perform only one analysis on one sample file. Analyzing a file several times is sort of an edge case now, even though it will be supported.

I'm not entirely convinced that this approach is terribly user-unfriendly, though. I'd like to hear some more people's opinions on this.

castillohair commented 8 years ago

fd9c9edff7841b65d7206e7e0d18365e3e3598b3 implements solutions for this issue, as well as for #47 and #50. The way the new module excel-ui is organized is such that I think it would be very easy to implement #38 and #44 without making the code messy. In addition, batch scripts were added to install and run by double clicking. Not exactly a solution to #58, but kinda close.

The modularity of excel_ui allows for new cool things. One is the fact that the Excel interface can be opened with a single terminal command from anywhere (as long as python is in the path). This allowed me to write those batch scripts that can be run without being in the same folder as fc.

Also, the individual parts of the workflow are coded in separate functions (i.e. a function to make standard curves from the Beads sheet, another one to process/gate/transform samples, another to calculate statistics, etc). So you can write a script that reads samples from the excel file and processes them the same way as the official excel interface, but returns the processed transformed samples as FCSData objects so you do whatever you like in python. This can be done with very few lines of code. Look at excel_ui.run() to see what I mean.

Finally, tables read from the excel file are converted to and from a format-neutral table object, and most of the functions use these objects as arguments. This makes most of the workflow independent of the fact that we're dealing with excel files. I think this is a good first step in the direction of implementing workflows. In fact, I imagine a workflow module that would contain most of the functions currently in excel_ui. And excel_ui itself would only have read_workbook, write_workbook and run, and would call workflow for everything else. Another module, gui would also use workflow. And somebody could write a csv_ui module that is similar to excel_ui with very little effort. This is waaaay into the future by the way.

Anyway, please look at it (branch excel-ui), try it, and let me know what you think. I want to test it a little more before making a pull request.

castillohair commented 8 years ago

107 solved this issue.