openworm / tracker-commons

Compilation of information and code bases related to open-source trackers for C. elegans
11 stars 12 forks source link

Matlab or Octave should have a full implementation #45

Closed Ichoran closed 7 years ago

Ichoran commented 8 years ago

The Julia implementation is not close enough in syntax to a Matlab or Octave implementation to be of great help to users of Matlab (or Octave). Thus, a Matlab or Octave implementation should be provided.

MichaelCurrie commented 8 years ago

I agree this would be very useful since many labs use Matlab, and so such a version would make it possible for many more labs to be able to generate WCON files from their trackers or use WCON in their analysis.

cheelee commented 8 years ago

Hi guys,

I’ve taken the liberty of implementing a very simple skeletal prototype for WCON access via Octave using the following API as an outline. The repository fork and branch may be found here: https://github.com/cheelee/tracker-commons/tree/octave/src/octave

API: * WCONWorms class object

Of the API calls, the ones that will behave as advertised are - load_from_file, load(string), is_equal(object1,object2). A simple driver code main.m exercises all the API calls (functional or merely a stub).

Please let me know if you’d like to have my fork’s branch pulled into the OpenWorm repository as the “octave” branch for tracker_commons. Thanks!

MichaelCurrie commented 8 years ago

Hi @cheelee, in my opinion this is a wonderful start, and I look forward to checking it out. Creating a branch and then a pull request into master would be the right thing to do as you say.

Yes a conflict is a data point that overlaps but that is not identical with the other worm file. So for instance two files holding worms with different ids, or the same worm over totally differnt time points, would merge with no issue. and merging the same object with itself will just yield the same object again.

.file_equal I'm not sure how that would work compared with .is_equal: if you're actually comparing the text in two files, that's not stored in the WCONWorms object so I'm not sure how you'd accomplish that, even if you wanted to.

Looking forward to checking it out some more!

MichaelCurrie commented 8 years ago

@cheelee you might find this useful:

2e6b8ced232603c6bed09d856d2afa441cae9265

I tried to be more explicit with the formal API that needs to be wrapped. It sounds like you've already got it all anyway, but in case you find it useful, here it is.

Also there was one public-facing change that will likely be approved (from #59): save_to_file now accepts an optional boolean parameter, compress_file, telling it whether to save in compressed form or not.

Ichoran commented 8 years ago

We haven't actually specified an API, just a set of capabilities. That said, it probably helps if the APIs are not too different from each other, or at least if not all APIs are different from each other. My Julia API isn't much like my Scala API, though. I could make the Julia one more like the Python one.

cheelee commented 8 years ago

Hehe yah, that's my impression that it's gonna take a bit more discussion. I believe @JimHokanson favors the Scala API currently. I have suggested that we should probably come up with a standard interface template, and do an alpha-testing "release" where the science guys could try using them and patch up inadequacies.

The problem with interfaces is that backward compatibility issues tend to bite hard once it goes out "into the wild" and real people use it for real stuff. But I'm confident if we work out the use-cases, we should have a good core that's extensible, so along the lines where API 2.0's not gonna end up deprecating much of 1.0.

cheelee commented 8 years ago

Meanwhile it has occurred to me that while I am no C. elegans scientist, I have some interest studying the movement of worms as if they were application code performance traces (they really are amazingly alike, which is what drew my interest in this direction!). So I could really be a guinea pig for writing my own analysis tools for worm movement data using some API template, and then using that experience to make some reasonable input toward a better API others can/would use.

JimHokanson commented 8 years ago

Hi all,

Regarding the different implementations, my preference is to have specific classes representing the contents of the file. The Scala implementation follows this by having data, metadata, and units classes.

Also, I like how the Scala implementation allows a transformation from sparse to a "combined" representation via a simple method call following loading. The Python implementation forces a data frame representation, but perhaps a user wants a numpy matrix. In terms of design, I'm suggesting that you have a relatively minimal loader, followed by simple methods or even input options to transform the data as needed.

I'm not entirely sure of the value of trying to specify an API at this point. It may be a good idea but I'm just not sure of the priority it should have. Consistency of the file format on the other hand seems very important.

On a different note, I started hacking around with Rex's file in Matlab. The first challenge was to look at JSON parsers. Pure Matlab implementations are basically useless. There is a mex implementation which is ok, although I was running into some speed issues with how JSON arrays are represented by the parser (i.e. each array element is not necessarily of the same data type, so they're not read by a general parser as an array of numbers, but rather a cell array - i.e. a set of pointers to individual objects). If I had to choose today I'd probably go with wrapping the Scala implementation. My initial testing with this approach went well, although it required increasing my Java Heap Space (which is easy but annoying to do in Matlab).

Ichoran commented 8 years ago

I haven't written the Scala class hierarchy with any thought to how it might look from Matlab. If it's awkward to use, I still have to write a Java version which will also wrap the Scala version and when I do that I can make it Matlab-friendly.

Regarding whether things are in data frames or something else, in whichever implementation, I don't think it really matters as long as there's not a big performance (or memory) penalty for putting it in that format. If there is, then asking for an extra method/function call is not unreasonable.

MichaelCurrie commented 8 years ago

I agree my loader in Python is too front-heavy right now. I'm currently refactoring so that each worm gets its own dataframe rather than all the worms in one, see #73. If later someone wants all the worms together that will be computed for them then.

@JimHokanson if you want to wrap Scala with Matlab, sounds good, after all the workflow we are hoping some labs will use is:

Experiment -> Matlab code to run experiment -> Matlab wrapper for Scala -> Scala encodes experiment as WCON -> WCON file is saved -> WCON file is loaded by Python version -> Python version is used by OWAT to generate features / statistics

(The exception to this is the Schafer Lab files, since I wrote a Python tool to convert their HDF5 format to WCON.)

@cheelee in my opinion it will definitely make sense for you to figure out how to "wrap" one of the existing versions (Scala, Python) rather than to start from scratch with your Octave version.

Ichoran commented 8 years ago

I would imagine there would be some of

Experiment -> someone saves some WCON somehow -> Matlab reads WCON -> Matlab generates features / statistics

also. If nobody used Matlab for data analysis, we wouldn't bother writing a WCON reader for it. Writing a basic writer is way easier than writing a reader. (I did it in about an hour for the Multi-Worm Tracker, and I'd completely forgotten how to write plugins and what the variable names were for the relevant data structures.)

cheelee commented 8 years ago

I'm still trying to fully process all of what's being discussed with regards to WCON capabilities for different languages and analysis tools, but the latest round seems to head in the direction of having some reference implementation (e.g. Scala) and then building the rest of the support in the form of wrapper libraries (perhaps automatically supported on the development side by something akin to the swig tool used for generating wrapper libraries.)

I think that's a valid (probably even good) approach, if you guys would like to pursue it. That approach tends to be taken with HPC tools (e.g. C/C++ back-end with loads of wrapper library front-ends), mostly to limit the amount of code maintenance involved for whenever the tool/library's capabilities changes.

@MichaelCurrie It does seem like SWIG will support Octave, Java, and R for library generation. So for those languages/tools, we ought to have automated support for wrapper library work. Am linking the swig web portal in case someone's not familiar with it - http://www.swig.org/.

cheelee commented 8 years ago

Just a quick FYI, I've begun looking into matlab and octave wrapper libraries for a Scala reference WCON library implementation. It does not look like a trivial exercise - Octave has interface capabilities to C/C++/Fortran via oct-files, whilst Matlab has interfaces to C via mex-files. My current idea is to figure out the C interface to Scala libraries, and then write C wrapper libraries for accessing Scala WCON capabilities followed by Matlab/Octave adapters to the C wrapper libraries.

Some short-term benefit of this approach is that changing the preferred reference implementation (say to another language) should only involve figuring out the C interface to the new implementation.

cheelee commented 8 years ago

It would appear I have just (re-)opened a messy can of worms with the icky way Java works on Mac OS X.

I will continue developing this line of thought on Ubuntu (through Virtualbox,) but trying to make C invoke JVM code via JNI on Mac OS X is proving painful (akin to jumping through hoops of barbed wire.) My original plan was to successfully invoke Java functions via JNI, and then look into extending the approach to Scala functions. If this gets any more painful, I'd advocate for a reference/alternative implementation based on a more stable platform like C/C++ that is more easily interfaced with by other languages.

Ichoran commented 8 years ago

@cheelee - Jim already had promising results with using the Java interface with Matlab. I'm not sure why you're proposing to do it in the way you're doing. If you want something with less awkward C bindings than Scala, why not use Python?

cheelee commented 8 years ago

This was driven in part by the apparent consensus in the group for Scala as the reference implementation for WCON. The basic principles I'm adopting as I approach this are:

  1. Implement Matlab/Octave for WCON as a wrapper library around one reference implementation.
  2. Preferably stick to just one reference implementation to avoid having to maintain multiple several implementations.

If there's no actual consensus for some reference implementation, I can just wrap Matlab/Octave around something less painful at first, and then let you guys figure out the path forward. For the future, I'd advocate for having a minimal set of implementation instances for WCON. Modifying 4-5 implementations in response to changes in a standard is no fun.

cheelee commented 8 years ago

@Ichoran And thanks for the heads up on Java and Matlab/Octave. I was not aware they had a direct interface mechanism. I had originally searched for how Matlab/Octave would interface with external libraries. I can look down that path too :).

Ichoran commented 8 years ago

I actually think that having multiple implementations is a good thing for a standard that we hope is used by many labs: it raises the bar on making gratuitous changes to the specification, and helps ensure that the specification is actually specified by text, not by a reference implementation. If you look at the difference between JSON and HDF5, for instance, JSON is fully specified in a short text document and has dozens of independent implementations, while HDF5 is mostly specified by a series of long text documents but really is only fully specified by the C++ reference implementation, which is as far as I can tell the only implementation (HDF5 group's Java implementation uses JNI to call the C++ library!).

That said, there's no great reason to do work a huge number of times, so a small set of implementations is probably enough. Eventually I will (if no-one beats me to it) do one in Rust or C++ that provides C bindings.

Also, I think that C is a spectacularly bad choice for a reference implementation because it is so low-level. You don't want accidents of pointer arithmetic, for instance, to determine expected behavior, and you want your error-checking to be pretty robust and deliver somewhat helpful messages. And if you want to change something in the specification, it's nice to be able to quickly change the reference implementation to try it out, which is harder with a language that does not let you work mostly at a high level. You might want a C implementation to be the primary one for usage if it is fast and not too hard to wrap, but not to specify the behavior.

That said, whether Scala or Python (or something else at equivalent level) is the reference implementation doesn't matter much. (The fairly weak arguments in favor of Scala are summarized in the Why Scala section of the Scala implementation's README.md.)

cheelee commented 8 years ago

@Ichoran That's cool :). I can swing both ways on this design point. I don't think we're at a point where we ought to be attempting to support many dozens of different languages each with their own implementations anyway. To be fair, I'd personally prefer writing wrapper libraries unless there are strong performance-related or functionality-driven imperatives.

To be honest, C isn't a terrible choice for relatively simple library implementations when done carefully. C++ runtimes for example has name-mangling issues for libraries across different compiler implementations that has to be navigated. These issues can get somewhat frustrating for users, and the frustrations can carry over to the devs.

Anyway I think what I'm going to do is quickly gun for a Octave-WCON wrapper library around anything that's functional, and is flexible enough to pivot on how we'd decide to move ahead with WCON capabilities and specification. I'm very definitely firmly in the camp of text specifications not changing much nor frequently. The only reason I had even thought to cast about for some stable reference implementation (e.g. Scala) was so I could get Octave-WCON up and running quickly without having to implement WCON features natively the way I had originally started out.

JimHokanson commented 8 years ago

@cheelee Given the requirements of changing the Java memory heap in Matlab I'm leaning towards a mex implementation. There are two different ones out there currently that I am aware of:

  1. https://github.com/christianpanton/matlab-json
  2. http://www.mathworks.com/matlabcentral/fileexchange/55972-json-encode---decode

Neither of these is ideal as neither allows direct conversion of JSON arrays to numerical arrays. Instead you need to read in all of the data as cell arrays and then convert them to numerical arrays, which is a big waste of time and memory. The latter option uses the "jsmn" tokenizer. I will probably spend a couple of hours quickly writing a parser based on the tokens. The parser will nicely handle array parsing in the "data" section.

Going forward though my intention is to spend more time on the writing aspect and data manipulation rather then trying to get the best parser possible.

Ichoran commented 8 years ago

@JimHokanson - How bad is the waste of time and memory? If you can read stuff approximately as fast as with the Python implementation, it should be good enough at least initially.

@cheelee - Well, the key word is "done carefully". Most things are okay when done carefully! For example, C++ mangles names but you can turn that off with extern "C".

cheelee commented 8 years ago

@JimHokanson Hmmm ok, I'll see what I encounter while I make a separate Octave attempt, and then maybe we can compare notes! Thanks!

Ichoran commented 8 years ago

@JimHokanson - Also, just how difficult is it to change the memory settings? It might be kind of annoying and time-consuming and poorly-documented and different on each platform, but fixing all of that with a clear set of instructions in the README might be a lot better overall than creating a new implementation.

Then again, if this is an excuse to get a good C or C++ implementation, I certainly wouldn't argue! I'm going to be writing WCON from C++ (but I was going to write an ad-hoc routine to spit out the internal data structures as WCON and have no read capability).

JimHokanson commented 8 years ago

@Ichoran Regarding the time difference with converting from individual data elements to numerical arrays, my rough testing has equated it as being nearly as bad if not worse than pure Matlab implementations, namely JSONlab. That speed (for JSONlab) was about 20s however it also has a known bug in it and other pitfalls of using.

The issue with any generic parser is the difference between {1,2,3} (a cell array where each element could hold anything and you have pointers and headers for each element) vs [1,2,3](a numerical array where each element is of the same type and you only have 1 header structure). That being said, I've modified the jsmn c tokenizer and now have a decent parser that gets your data file into a final Matlab format in about 3.5 seconds on my desktop and 3 seconds on my 2013 Macbook Air. My desktop is generally a lot faster so I'm wondering if the compiler is making a large difference (MSVS 12 vs Xcode 7 with Clang). I have a few tweaks and a few checks to make but I think the final time will probably remain the same.

Regarding the Java settings, it is really easy to change (it is a setting in the menu). However, it's really annoying to have to permanently set aside a chunk of memory (when the program is running) unless you have more memory than you know what to do with. Then, if you happen to exceed the set aside memory requirement when making Java calls, you have to go in again and adjust, even if it was only a one time large memory request. All of this escaped my mind when originally thinking of using a Java implementation.

So in summary at this point I think I've made enough progress that I'll be moving ahead with my c/Matlab parser and focusing on the class (i.e. WCON specific) functionality.

For those that are curious the json implementation can be found at: https://github.com/JimHokanson/jims_json_testing/tree/master/jsmn_mex

Ichoran commented 8 years ago

Okay, great! Good luck!

Ichoran commented 8 years ago

@JimHokanson - For what it's worth, the next iteration of the Scala implementation will read my data file in a little over half a second on my laptop (taking advantage of a much faster JSON parser). I doubt the difference in speed will matter for anyone using Matlab, though. (It will matter for me, as I intend to produce a few thousand of those data files per day.) And I'm not sure how much of your 3-3.5 seconds were spent manipulating data structures in Matlab instead of parsing.

JimHokanson commented 8 years ago

@Ichoran Just for my reference, does that half second include the time spent reading the file into memory? I'm trying gauge how fast my c code is working. :)

Ichoran commented 8 years ago

The file's coming off a m.2 SSD, so the file read time is under a tenth of a second. So approximately, "no". If you read into memory first, or read a bunch of times and take the 10th one or something after the OS has had maximal chance to cache it, it'll probably be a fairer test. Again, I don't think it matters too much at this point. Existence is better than speed, as long as the performance isn't so abysmal that it is not effectively different than not-existing.

cheelee commented 8 years ago

Just a quick update on Octave-WCON implemented as a wrapper around the Python implementation. The very basic embedded-Python C/C++ API bindings are complete for both the WCONWorms and MeasurementUnit classes. They may now be invoked by any client C/C++ code.

The next significant steps are:

  1. Connecting Octave-C++ bindings generated using the swig tool, to the C/C++ wrappers for the Python implementation. This should be pretty straightforward, almost trivial.
  2. Building a Octave-specific sub-API for handling data elements: units, metadata, data, data_as_odict. These are the somewhat non-trivial ones (e.g. I'll need to figure out pandas DataFrames) but worm_ids should be reasonably straightforward.
  3. Building Python 2 vs Python 3 robustness into the wrapper library.
  4. Refactoring the wrapper library to be sensitive to thread safety issues.
  5. Refactoring the wrapper library as a code cleanup operation (it's kind of a quick-and-dirty mess right now.)
  6. Considering maintenance issues (i.e. how the wrapper library will have to react to changes in the Python implementation from a developer documentation standpoint)

Once I get 1-5 done, I'll make another pull request, and then work on automated regression testing, and performance regression testing for the wrapper library.

MichaelCurrie commented 7 years ago

@cheelee and I discussed this over the weekend and we'd like to update this issue to better reflect the reality of where the project is going.

So we can close this issue.