castillohair commented 9 years ago

Right now, the fc library itself only contains basic functions for reading and writing excel files, whereas most of the interpretation and work is made in run_excel.py. From a coding perspective, this is a little awkward, as the small excel part of the library is only there to support run_excel.py, but it's otherwise disjointed from the rest of the library.

There are two options to improve this:

Remove all excel support from the library, and make run_excel include its own functions for reading/writing excel files.
Add some of the functionality in run_excel.py to the library.

I prefer the last option because 1. in principle one of the goals of the package is to be able to be used without coding, and 2. allows unittesting. However, we would have to decide what goes in and what doesn't. I'll think about this and probably handle this myself in the near future, but I welcome suggestions.

JS3xton commented 9 years ago

I agree with option 2; I like bringing it inside the library. Also not sure how best to do it, but in general I like that approach better.

On a kind of related note:

Your three example python scripts concretely implement a concept I was throwing around called a "workflow". I was entertaining the idea of having different "workflows" (basically just different Python scripts which analyze data in slightly different ways (different cytometers, 1D versus 3D clustering, median versus mode, etc.)) and somehow having the Excel UI be compatible with those workflows (so any "workflow" could be populated by the Excel UI). This seemed like it might be overkill, though, and I couldn't come up with a simple enough "workflow" implementation that didn't involve stacking on a bunch of extra constraints/restrictions that people wouldn't care enough about to figure out.

castillohair commented 8 years ago

The "workflow" idea is interesting, and has actually been suggested by other people as well (with slightly different details). However, I think that, for now, the only "workflow" that would be used by more than one person is the one implemented in run_excel.py. If we see that there are other common use cases, we should think more about this. Otherwise, I will put this idea aside for now, especially given that it may not be trivial to implement.

castillohair commented 8 years ago

I've come up with a general idea of what the Excel interface should look like. There will be a new module excel_ui, to replace the current excel_io, which will contain:

A set of low level functions, to read and write excel files.
A set of mid-level functions, to use the data structures resulting from reading an Excel sheet, and to load FCSData objects from FCS files when necessary.
One high-level function, exposed to the user, to implement the workflow currently implemented in run_excel.py. Similar functions implementing different workflows could be added in the future.

An additional thing would be to have a main function (__main__) that calls the workflow function, which would facilitate running the excel interface in a single line (if I understand correctly what's on https://docs.python.org/2/using/cmdline.html):

python -m fc.excel_ui

This line could be placed in a batch file (.bat in windows, .sh in linux) and run by double clicking from anywhere.

JS3xton commented 8 years ago

Originally I was unsure as to whether this functionality should exist in its own module or under the io module. I agree with what you have proposed here: separate module for all Excel-related functionality (renaming it helps reinforce the distinction).

I like the idea of being able to add new workflows (however they end up manifesting themselves).

Related to having a __main__ function, you could also consider making it a standalone script that can be run from the terminal in a UNIX-style fashion (pass all inputs to the high level function via command line arguments). I've made some of my older tools cute this way (e.g. primer finder, target sequence design tool, neural network promoter predictor wrapper). argparse module helps a lot for parsing input flags from the command line.

castillohair commented 8 years ago

I made a new branch excel-ui, and added an example input Excel file on commit ce3ffa91c26d38218782d11d70fc9579056ff980, file test/test_excel_ui.xlsx. Can I get your input on this file? @JS3xton @BrianLandry @thoreusc.

Some comments:

New "Instruments" sheet, which implements ideas from #47. This allows for files from different flow cytometers to be used in the same spreadsheet.
Changed name of "Cells" sheet to "Samples", sounds more general in my opinion.
Everything has IDs now, including instruments, beads files, and sample files. This would solve #50.
Changed "FLX peaks" to "FLX MEF Values" in the Beads sheet.
Changed "FL1 Transform" to "FL1 Units" in the Samples sheet. I think this is less confusing to the user (you use "Arbitrary" to get arbitrary units instead of "Exponential"). In addition, this allows for some automation: using "Arbitrary" would perform the exponential transformation only if the file has been acquired with a log amplifier. I think having the user only care about the resulting units is simpler.

I'll let you guys figure out the rest, in part because how the spreadsheet works should be self evident if the design is good.

JS3xton commented 8 years ago

Thoughts:

Does it need to be an Excel spreadsheet? Or can it just be a CSV? CSV seems more general to me if we're not using any formulas.
- Not sure if this makes spreadsheet import/export more or less difficult. Maybe get rid of the xlrd and openpyxl package dependencies? I would recommend checking out pandas as it has pretty good CSV import functionality; just tends to work. User wouldn't need to know or use pandas either (just as they don't have to know xlrd or openpyxl).
- I guess you need Excel to have tabs? Would need 3 separate CSV files to replace 1 Excel file, which may not be desirable. More UNIX-friendly, perhaps, but generally less user friendly.
I prefer "UID" to "ID" to make it clear that that column is like a database key (needs to be unique). This is minor, though.
I kinda see your logical flow of tabs in that Instruments is kinda the first thing you need to resolve, but I also think it will change the least after you've set it up. As such, I would kinda prefer it as the last tab, not the first, for every other time I open this sheet besides the first time.
On the Instrument tab, are the channel labels case sensitive? (I see you have Time and TIME; I guess they probably should be?)
On the Instrument tab, I also think I see your reasoning behind the ordering of the FSC, SSC, Time, Fluorescence channels columns in that the first 3 don't need to be comma-separated lists. I feel like Time is usually such an afterthought, though, and it was kinda awkward to break up FSC/SSC from the other Fluorescence channels. I guess I'm proposing that Time come after Fluorescence Channels (or before everything, like it now is in the new FCS files), but this really isn't very important.
I like your naming conventions for the IDs, but I just want to note that it should also work with any strings which are unique within a given column.
I've gone back and forth on calling things "Cells" or "Samples". I recently went with "Cells" myself (although I didn't particularly like it) because I consider "Samples" to also include beads; anything that is run on the flow is a flow sample (including beads), anything grown in a culture tube is a culture tube sample, etc. Extending this to the spreadsheet, I might think the bead flow files should be specified as "Samples" and the "Beads" tab simply identifies which sample IDs are beads (which is not currently how it works). Would love a clearer name than either of these.
- Analyses might be better? Because you might actually analyze the same sample multiple times with different settings, which is a little unintuitive if you come in thinking 1 sample per row.
On Beads, I struggled with the "Lot" column when I made a similar sheet, only because I like to label my beads by the UID I write on the bottles (I add an extra 2 digit number after the lot number to uniquely identify the bottle every time we get new beads: LLLL-##). This may be something more appropriate for the lab notebook, though, not the analysis pipeline.
- I'm not actually sure what you're using "Lot" for since you're also specifying the MEF Values. As such, this is effectively just a label. You might consider changing the heading to "Label", in which case I would be free to use either bead identification convention.
This is more for down the road, but it would be nice if we threw warnings if we detected incompatibilities (e.g. a sample with channels which did not agree with the specified instrument channel names).
On the Samples tab, "Strain name" and "IPTG (uM)" are just columns which are carried through the analysis to populate the output Excel spreadsheet, right? If that's the case, how will you identify each row (like for saving plots to files, etc.)? via ID? I was playing around with an optional "Label" column which would simply be to assist in labeling plots and generating file names (would still want ID for anything that needs to be unique, like a file name).
I have an issue with "FL1 Units" only because "arbitrary units" are by definition not unique. As previously discussed at length, "channels" are also technically "a.u.". I didn't really have a problem with "FL1 Transform", but I also like to know what's going on under the hood. From the programmers perspective, you can also be a little more ignorant if you just accept the transform instead of the desired final units (i.e. if they really wanna perform an Exponential transformation on data that was collected on a linear amplifier, then so be it. Probably not very useful, but still a valid user input combination). This may admittedly be less user friendly. I tend to favor doing whatever the user says, though, instead of trying to hold their hand everywhere.

BrianLandry commented 8 years ago

In general all of the changes are good.

Changed "FL1 Transform" to "FL1 Units" in the Samples sheet. I think this is less confusing to the user (you use "Arbitrary" to get arbitrary units instead of "Exponential"). In addition, this allows for some automation: using "Arbitrary" would perform the exponential transformation only if the file has been acquired with a log amplifier. I think having the user only care about the resulting units is simpler.

I like this, its needs a decent explanation somewhere, but in general it is good. I do agree with John objection to Aribtrary Units, at first I actually thought this meant the channel units. It seems you already touched upon the solution, just call it Linear if you are going to force them to be linear.

Does it need to be an Excel spreadsheet? Or can it just be a CSV? CSV seems more general to me if we're not using any formulas.

Not sure if this makes spreadsheet import/export more or less difficult. Maybe get rid of the xlrd and openpyxl package dependencies? I would recommend checking out pandas as it has pretty good CSV import functionality; just tends to work. User wouldn't need to know or use pandas either (just as they don't have to know xlrd or openpyxl).

I guess you need Excel to have tabs? Would need 3 separate CSV files to replace 1 Excel file, which may not be desirable. More UNIX-friendly, perhaps, but generally less user friendly.

Tabs are the main reason an excel document is used. This both simplifies the storage of the information pertaining to the experiment, but also is more user intuitive, some people might not know how to work with CSVs, not to mention excel is a pain with saving them.

New "Instruments" sheet, which implements ideas from #47. This allows for files from different flow cytometers to be used in the same spreadsheet.

How will they know the correct parameters to use with their instrument, considering they are not programmers (using the excel interface)?

JS3xton commented 8 years ago

Correct parameter names have been a problem. Short of actually seeing the innards of the FCS file, I'm not aware of any way of knowing what the acquisition software decided to name all of the parameters. We could expose some functionality where we list the channel names for a given input file (either in an Excel spreadsheet, or in the terminal. I favor the terminal if we're using it anyways to print status messages), from which the user could populate the Instruments tab.

The alternative is to have FlowCal try and guess what the acquisition software named the parameters, but @castillohair and I think that's tricky given all the parameter name variants we've seen. I favor offloading that in the user rather than run the risk of guessing the wrong channel (FSC-H? FSC-A? etc.).

castillohair commented 8 years ago

Thanks for your comments, guys. I have a few things add.

Yeah this has to be an Excel spreadsheet.
- Tabs are one reason: the current proposal uses three tabs that fc would care about, but you can have any other tabs with metadata relevant to the experiment, making the input spreadsheet the central information storage unit of the experiment.
- Formulas are also important. One could have additional columns in the "Samples" sheet that record preculture ODs, and calculate inoculation volumes to achieve a certain initial OD automatically. Similarly with inducer concentrations, etc. This is, from the point of view of fc, metadata that is ignored most of the time. But using an Excel spreadsheet allows this.
- The final reason is familiarity. We expect non-programmers to use the Excel interface. I assume most of them can open an Excel file without issues. If we give them a csv instead, I imagine their reaction will be something like "what is this weird excel file that doesn't allow tabs, formulas, formatting, or anything besides text?"
Excel files are opened in the last sheet displayed when the file was saved. So the order of sheets doesn't matter from this perspective.
I think it makes sense to have the Time channel at the end. I made this change.
Channel (or "Parameter", according to the FCS standard) names are case sensitive. Having a regular user find the exact names is something for which there is currently no good solution. I'll summarize your ideas with what I've thought about:
- A separate script that opens an FCS file and prints channel names in the console.
- A separate script that opens an FCS file and makes a properly formatted Excel input file, with the Instruments tab filled with fc's best guess.
- The Excel interface will start with a window with two options: "Extract channel names" and "Process input Excel file". the first option will show an open file dialog, and then show the channel names in a separate small window. The second will show an open file dialog and proceed as it does currently. This way the Excel interface seems more "integrated", avoiding the user having to run two different scripts.
Yeah, I agree that IDs only need to be unique strings.
The "Lot" column is only for documentation purposes, and it will be ignored by fc. This should be made very clear in the documentation, though. If it's too confusing, I can remove it.
The documentation should make clear which columns are required, and which are just metadata. The current implementation just copies metadata columns to the output spreadsheet. In this example, Metadata columns are "Lot" in the Beads sheet, and "Strain name" and "IPTG" in the "Samples" sheet.
Regarding (transformed) Arbitrary Units, how about calling them "Relative Fluorescence Units" (RFU)? I've seen a similar term, "Relative Fluorescence Intensity (RFI) Units" in some old flow cytometry papers (http://onlinelibrary.wiley.com/doi/10.1002/(SICI)1097-0320(19981001)33:2%3C106::AID-CYTO4%3E3.0.CO;2-H/epdf, http://onlinelibrary.wiley.com/doi/10.1002/(SICI)1097-0320(19960315)26:1%3C22::AID-CYTO4%3E3.0.CO;2-I/epdf). Untransformed units could still be called "Channel" (I've seen the term "Channel number" in papers) or just "Raw". That way we completely avoid the term "Arbitrary Units"
Regarding automatic transformation to RFU, the procedure followed should be well documented of course. Most people, however, will just want some fluorescence units in which "20" means twice the intensity than "10" (if they're not using MEF of course). This is the approach that acquisition software follows: the three packages that I've used show transformed numbers by default with little explanation. If someone really wants stats taken from raw data, he can do it with the "Channel" option. If someone wants to do something like exponentiate data taken with a linear amplifier, he either really knows what he's doing, or he really doesn't. In both cases, the Python API is there for that.

So far I'm only making one change, which is moving the time channel name to the end in the "Instruments" tab. I'll commit this change and work on code based on this. If someone feels strongly about changing something else, let me know.

castillohair commented 8 years ago

Forgot one thing: I have mixed feelings about changing the name "Samples" to "Analyses". It is more technically correct, and actually helps convey the idea that several analysis can be run on the same file. On the other hand, I think this makes it less intuitive. Most of the time, people want to perform only one analysis on one sample file. Analyzing a file several times is sort of an edge case now, even though it will be supported.

I'm not entirely convinced that this approach is terribly user-unfriendly, though. I'd like to hear some more people's opinions on this.

castillohair commented 8 years ago

fd9c9edff7841b65d7206e7e0d18365e3e3598b3 implements solutions for this issue, as well as for #47 and #50. The way the new module excel-ui is organized is such that I think it would be very easy to implement #38 and #44 without making the code messy. In addition, batch scripts were added to install and run by double clicking. Not exactly a solution to #58, but kinda close.

The modularity of excel_ui allows for new cool things. One is the fact that the Excel interface can be opened with a single terminal command from anywhere (as long as python is in the path). This allowed me to write those batch scripts that can be run without being in the same folder as fc.

Also, the individual parts of the workflow are coded in separate functions (i.e. a function to make standard curves from the Beads sheet, another one to process/gate/transform samples, another to calculate statistics, etc). So you can write a script that reads samples from the excel file and processes them the same way as the official excel interface, but returns the processed transformed samples as FCSData objects so you do whatever you like in python. This can be done with very few lines of code. Look at excel_ui.run() to see what I mean.

Finally, tables read from the excel file are converted to and from a format-neutral table object, and most of the functions use these objects as arguments. This makes most of the workflow independent of the fact that we're dealing with excel files. I think this is a good first step in the direction of implementing workflows. In fact, I imagine a workflow module that would contain most of the functions currently in excel_ui. And excel_ui itself would only have read_workbook, write_workbook and run, and would call workflow for everything else. Another module, gui would also use workflow. And somebody could write a csv_ui module that is similar to excel_ui with very little effort. This is waaaay into the future by the way.

Anyway, please look at it (branch excel-ui), try it, and let me know what you think. I want to test it a little more before making a pull request.

castillohair commented 8 years ago

taborlab / FlowCal

Reorganization of excel relevant code #48

107 solved this issue.