openworm / OpenWorm

Repository for the main Dockerfile with the OpenWorm software stack and project-wide issues
http://openworm.org
MIT License
2.65k stars 206 forks source link

Create Python Script that explores WormBehavior data repository #82

Closed slarson closed 10 years ago

slarson commented 11 years ago

The C. elegans behavioral database in the UK contains 10,000 independent movies (examples here) of both wild type and mutant worms moving around that have been segmented and digitized. From this digitization, hundreds of features about the movement of these worms, from a few minutes to several hours, have been extracted. The main citation for this currently is Brown et al., 2013

This data set is potentially crucial to validate the OpenWorm model.

This task is to use the H5Py library, to create a script that explores the .mat files of the WormBehavior data repository. Basic functionality should demonstrate how to:

  1. Access a .mat file from the project
  2. Extract the feature structure from a specific .mat file
  3. Extract examples of time series from a specific .mat file and plot it using matplotlib.

The script should be well documented and serve as a jumping off point for more serious analysis of these data sets.

Here are notes for how to get a hold of the data files from Ev Yemini, the main creator of the data set who has been the point of contact to the project:

Everything is located under: ftp://anonymous@ftp.mrc-lmb.cam.ac.uk/pub/tjucikas/wormdatabase/results-12-06-08/Laura%20Grundy

I would strongly suggest using an FTP client to download large subdirectory structures that suit your needs.

The next part looks confusing but it's really simple to get used to and fairly straightforward. The subdirectories are organized as follows (don't worry about parsing annotations from the directory structure, all annotations are present within the feature files as well):

  1. When present, the first subdirectory is the gene name (e.g., "unc-8"); otherwise, for wild isolates and N2 the subdirectory is "gene_NA".
  2. When present, the next subdirectory is the allele (e.g., "n491n1192"); otherwise, for wild isolates and N2 the subdirectory is "allele_NA".
  3. The subdirectory thereafter is the strain name (e.g., "AQ2947" is the Schafer lab copy of the CGC's N2). The strain name is always present.
  4. Beyond this point the subdirectories describe whether the worm is on food ("on_food" or "off_food" -- only a small subset of N2s and MECs were done off food). The sex ("XX" or "XO" -- the only males are N2). Whether a habituation period was observed ("30m_wait" or "no_wait" -- 25 N2 experiments were done with no habituation and recorded for 2 hours straight; otherwise, we always observed a 30 minute habituation period).
  5. At the end the subdirectories become far less meaningful to you. They indicate the ventral side ("L" = anti-clockwise or "R" = clockwise -- this can be confusing due to the orientation of the video vs. the experimenter's annotation). The tracker we used (1 through 8). The date (YYYY-MM-DD___HH_MM_SS). And, finally, the experiment's filename. The actual feature files contain further annotations (e.g., the room we used, the frame rate, ...).

Here are 2 examples:

  1. unc-8(n491n1192) ftp://anonymous@ftp.mrc-lmb.cam.ac.uk/pub/tjucikas/wormdatabase/results-12-06-08/Laura%20Grundy/unc-8/n491n1192/MT2611/on_food/XX/30m_wait/L/tracker_2/2010-03-19_09_14_57/unc-8%20(rev)%20on%20food%20R_2010_03_1909_14_5722_features.mat
  2. CB4856 - the famous Hawaiian wild isolate ftp://anonymous@ftp.mrc-lmb.cam.ac.uk/pub/tjucikas/wormdatabase/results-12-06-08/Laura%20Grundy/gene_NA/allele_NA/CB4856/on_food/XX/30m_wait/L/tracker_1/2010-11-25_11_33_52/399%20CB4856%20on%20food%20R_2010_11_2511_33_5211_features.mat

We've created a simple iPython Notebook to walk through an example of using python to grab and open these files.

Even more information is available on this thread

davidad commented 11 years ago

Very cool! .mat files are in HDF5 format, a severely underappreciated standard and possibly the best thing about MATLAB. You might consider representing the FTP server as a git-annex (not a built-in git feature, at least not just yet, but covers a lot of use cases including, I think, this one). On Apr 8, 2013 11:12 PM, "Stephen Larson" notifications@github.com wrote:

The C. elegans behavioral databasehttp://wormbehavior.mrc-lmb.cam.ac.uk/index.phpin the UK contains 10,000 independent movies (examples here https://www.youtube.com/user/wormbehavior) of both wild type and mutant worms moving around that have been segmented and digitized. From this digitization, hundreds of features about the movement of these worms, from a few minutes to several hours, have been extracted. The main citation for this currently is Brown et al., 2013http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3545781/?tool=pmcentrez&rendertype=abstract

This data set is potentially crucial to validate the OpenWorm model.

This task is to use the PyMat library, to create a script that explores the .mat files of the WormBehavior data repository. Basic functionality should demonstrate how to:

  1. Access a .mat file from the project
  2. Extract the feature structure from a specific .mat file
  3. Extract examples of time series from a specific .mat file and plot it using matplotlib.

The script should be well documented and serve as a jumping off point for more serious analysis of these data sets.

Here are notes for how to get a hold of the data files from Ev Yemini, the main creator of the data set who has been the point of contact to the project:

Everything is located under: ftp://anonymous@ftp.mrc-lmb.cam.ac.uk /pub/tjucikas/wormdatabase/results-12-06-08/Laura%20Grundy

I would strongly suggest using an FTP client to download large subdirectory structures that suit your needs.

The next part looks confusing but it's really simple to get used to and fairly straightforward. The subdirectories are organized as follows (don't worry about parsing annotations from the directory structure, all annotations are present within the feature files as well):

1.

When present, the first subdirectory is the gene name (e.g., "unc-8"); otherwise, for wild isolates and N2 the subdirectory is "gene_NA". 2.

When present, the next subdirectory is the allele (e.g., "n491n1192"); otherwise, for wild isolates and N2 the subdirectory is "allele_NA". 3.

The subdirectory thereafter is the strain name (e.g., "AQ2947" is the Schafer lab copy of the CGC's N2). The strain name is always present. 4.

Beyond this point the subdirectories describe whether the worm is on food ("on_food" or "off_food" -- only a small subset of N2s and MECs were done off food). The sex ("XX" or "XO" -- the only males are N2). Whether a habituation period was observed ("30m_wait" or "no_wait" -- 25 N2 experiments were done with no habituation and recorded for 2 hours straight; otherwise, we always observed a 30 minute habituation period). 5.

At the end the subdirectories become far less meaningful to you. They indicate the ventral side ("L" = anti-clockwise or "R" = clockwise -- this can be confusing due to the orientation of the video vs. the experimenter's annotation). The tracker we used (1 through 8). The date (YYYY-MM-DD___HH_MM_SS). And, finally, the experiment's filename. The actual feature files contain further annotations (e.g., the room we used, the frame rate, ...).

Here are 2 examples:

1.

unc-8(n491n1192) ftp://anonymous@ftp.mrc-lmb.cam.ac.uk /pub/tjucikas/wormdatabase/results-12-06-08/Laura%20Grundy/unc-8/n491n1192/MT2611/on_food/XX/30m_wait/L/tracker_2/2010-03-19_09_14_57/unc-8%20(rev)%20on%20food%20R_2010_03_1909_14_5722_features.mat 2.

CB4856 - the famous Hawaiian wild isolate ftp://anonymous@ftp.mrc-lmb.cam.ac.uk /pub/tjucikas/wormdatabase/results-12-06-08/Laura%20Grundy/gene_NA/allele_NA/CB4856/on_food/XX/30m_wait/L/tracker_1/2010-11-25_11_33_52/399%20CB4856%20on%20food%20R_2010_11_2511_33_5211_features.mat

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/82 .

slarson commented 11 years ago

Interesting suggestion, David-- thanks for that. I am not familiar with Git-annex so this is a good opportunity to learn.

More info on git-annex is here:

http://git-annex.branchable.com/

Thanks, Stephen

On Tuesday, April 9, 2013, David A. Dalrymple wrote:

Very cool! .mat files are in HDF5 format, a severely underappreciated standard and possibly the best thing about MATLAB. You might consider representing the FTP server as a git-annex (not a built-in git feature, at least not just yet, but covers a lot of use cases including, I think, this one). On Apr 8, 2013 11:12 PM, "Stephen Larson" <notifications@github.com<javascript:_e({}, 'cvml', 'notifications@github.com');>> wrote:

The C. elegans behavioral database< http://wormbehavior.mrc-lmb.cam.ac.uk/index.php>in the UK contains 10,000 independent movies (examples here https://www.youtube.com/user/wormbehavior) of both wild type and mutant worms moving around that have been segmented and digitized. From this digitization, hundreds of features about the movement of these worms, from a few minutes to several hours, have been extracted. The main citation for this currently is Brown et al., 2013< http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3545781/?tool=pmcentrez&rendertype=abstract>

This data set is potentially crucial to validate the OpenWorm model.

This task is to use the PyMat library, to create a script that explores the .mat files of the WormBehavior data repository. Basic functionality should demonstrate how to:

  1. Access a .mat file from the project
  2. Extract the feature structure from a specific .mat file
  3. Extract examples of time series from a specific .mat file and plot it using matplotlib.

The script should be well documented and serve as a jumping off point for more serious analysis of these data sets.

Here are notes for how to get a hold of the data files from Ev Yemini, the main creator of the data set who has been the point of contact to the project:

Everything is located under: ftp://anonymous@ftp.mrc-lmb.cam.ac.uk /pub/tjucikas/wormdatabase/results-12-06-08/Laura%20Grundy

I would strongly suggest using an FTP client to download large subdirectory structures that suit your needs.

The next part looks confusing but it's really simple to get used to and fairly straightforward. The subdirectories are organized as follows (don't worry about parsing annotations from the directory structure, all annotations are present within the feature files as well):

1.

When present, the first subdirectory is the gene name (e.g., "unc-8"); otherwise, for wild isolates and N2 the subdirectory is "gene_NA". 2.

When present, the next subdirectory is the allele (e.g., "n491n1192"); otherwise, for wild isolates and N2 the subdirectory is "allele_NA". 3.

The subdirectory thereafter is the strain name (e.g., "AQ2947" is the Schafer lab copy of the CGC's N2). The strain name is always present. 4.

Beyond this point the subdirectories describe whether the worm is on food ("on_food" or "off_food" -- only a small subset of N2s and MECs were done off food). The sex ("XX" or "XO" -- the only males are N2). Whether a habituation period was observed ("30m_wait" or "no_wait" -- 25 N2 experiments were done with no habituation and recorded for 2 hours straight; otherwise, we always observed a 30 minute habituation period). 5.

At the end the subdirectories become far less meaningful to you. They indicate the ventral side ("L" = anti-clockwise or "R" = clockwise -- this can be confusing due to the orientation of the video vs. the experimenter's annotation). The tracker we used (1 through 8). The date (YYYY-MM-DD___HH_MM_SS). And, finally, the experiment's filename. The actual feature files contain further annotations (e.g., the room we used, the frame rate, ...).

Here are 2 examples:

1.

unc-8(n491n1192) ftp://anonymous@ftp.mrc-lmb.cam.ac.uk

/pub/tjucikas/wormdatabase/results-12-06-08/Laura%20Grundy/unc-8/n491n1192/MT2611/on_food/XX/30m_wait/L/tracker_2/2010-03-19_09_14_57/unc-8%20(rev)%20on%20food%20R_2010_03_1909_14_5722_features.mat

2.

CB4856 - the famous Hawaiian wild isolate ftp://anonymous@ftp.mrc-lmb.cam.ac.uk

/pub/tjucikas/wormdatabase/results-12-06-08/Laura%20Grundy/gene_NA/allele_NA/CB4856/on_food/XX/30m_wait/L/tracker_1/2010-11-25_11_33_52/399%20CB4856%20on%20food%20R_2010_11_2511_33_5211_features.mat

— Reply to this email directly or view it on GitHub< https://github.com/openworm/OpenWorm/issues/82> .

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/82#issuecomment-16119901 .

slarson commented 11 years ago

@balazs1987 -- did you say you were maybe going to take a look at this database?

aruscher commented 11 years ago

Python2 vs. Python3? Or ive missed a discussion?

vellamike commented 11 years ago

Python 3 if possible. We have a lot of code that's only executable with the python 2 interpreter, but anything written for python 3 will work with the python 2 interpreter anyway.

At some point in the future we will translate all this to python 3. On 20 May 2013 21:25, "Andreas Ruscheinski" notifications@github.com wrote:

Python2 vs. Python3? Or ive missed a discussion?

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/82#issuecomment-18170316 .

aruscher commented 11 years ago

where can i find all python code?

vellamike commented 11 years ago

For this specific issue I don't think any code has been written yet - @slarson and @balazs1987 can you help @aruscher out with this - you both know a lot more about this issue than I do.

szb37 commented 11 years ago

hello,

I had a look at the code, but only with the specific purpose of how we can use it to construct a markov model of behavioural motifs, that is to put the temporal components into the eigenworms. Based on the email communication I think Ev and the Cambridge team is working on something similar at the moment as well, but we can look at it on our own as well.

Because I have very little computer/programming skills I do not think I can take this task on my own, however if Andreas is interested we can have a crack at it together. Andreas, would you be interested in pushing this forward together? I know what we should do with this data, but would have trouble implementing it on my own, however if you can help me with that I think we can make real progress on this together

Best, Balazs

2013/5/20 Mike Vella notifications@github.com

For this specific issue I don't think any code has been written yet - @slarson https://github.com/slarson and @balazs1987https://github.com/balazs1987can you help @aruscher https://github.com/aruscher out with this - you both know a lot more about this issue than I do.

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/82#issuecomment-18175610 .

vellamike commented 11 years ago

Hi Balazs,

I would suggest you write a document detailing the direction you wish to pursue on this and mathematical techniques you will use, it will be very difficult for anyone with programming skills to work on this otherwise.

Mike

On 21 May 2013 10:46, balazs1987 notifications@github.com wrote:

hello,

I had a look at the code, but only with the specific purpose of how we can use it to construct a markov model of behavioural motifs, that is to put the temporal components into the eigenworms. Based on the email communication I think Ev and the Cambridge team is working on something similar at the moment as well, but we can look at it on our own as well.

Because I have very little computer/programming skills I do not think I can take this task on my own, however if Andreas is interested we can have a crack at it together. Andreas, would you be interested in pushing this forward together? I know what we should do with this data, but would have trouble implementing it on my own, however if you can help me with that I think we can make real progress on this together

Best, Balazs

2013/5/20 Mike Vella notifications@github.com

For this specific issue I don't think any code has been written yet - @slarson https://github.com/slarson and @balazs1987< https://github.com/balazs1987>can you help @aruscher https://github.com/aruscher out with this - you both know a lot more about this issue than I do.

— Reply to this email directly or view it on GitHub< https://github.com/openworm/OpenWorm/issues/82#issuecomment-18175610> .

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/82#issuecomment-18198744 .

aruscher commented 11 years ago

@balazs1987 yes we can (try it) :-p

slarson commented 11 years ago

@aruscher -- I've linked the PyMat library that would be the foundation for this in the description. Is that enough to start with?

aruscher commented 11 years ago

i have no version of matlab -.- pymat need matlab, scipy need matlab .....for open a .mat file. i will try h5py or smth else. can anybody provide an rip of the whole ftp repo converted to txt or an other readable file? afaik there are opportuinitys to convert it with matlab to txt.

szb37 commented 11 years ago

Hey Andreas,

How much do you know about the worm and its behaviour? I intend to write a detailed document - something like what Mike mentioned -, but depending on your level of expertise I will skip different parts.

Best, Balazs

2013/5/21 Andreas Ruscheinski notifications@github.com

@balazs1987 https://github.com/balazs1987 yes we can try it :-p

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/82#issuecomment-18199405 .

vellamike commented 11 years ago

Balazs,

I think it would help in general if you put as much detail as possible. Wise man told me no scientist ever minds being told something they already know.

Mike

On 22 May 2013 15:58, balazs1987 notifications@github.com wrote:

Hey Andreas,

How much do you know about the worm and its behaviour? I intend to write a detailed document - something like what Mike mentioned -, but depending on your level of expertise I will skip different parts.

Best, Balazs

2013/5/21 Andreas Ruscheinski notifications@github.com

@balazs1987 https://github.com/balazs1987 yes we can try it :-p

— Reply to this email directly or view it on GitHub< https://github.com/openworm/OpenWorm/issues/82#issuecomment-18199405> .

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/82#issuecomment-18284231 .

slarson commented 11 years ago

@AndrewPa Check out this issue for a place to help out using with Python

JimHokanson commented 11 years ago

@slarson I'm looking to help out with the project (in general) and this seems like an easy place to start. I'm just learning Python but I have a lot of experience with Matlab. Please advise.

SuFizz commented 11 years ago

In case the worry is about more ppl not being able to contribute because of the .mat files, we can use Octave to load the files and use that to convert into .txt files and use that. I am sure most of you might have already thought about using Octave to ease out problems with having to use Matlab but anyways.

slarson commented 11 years ago

HI all -- thanks @JimHokanson and the others for the volunteering offer. Several folks are interested in this. To kick this off I have updated the issue with some additional key information to get going:

We've created a simple iPython Notebook to walk through an example of using python to grab and open these files.

Even more information is available on this thread

Next steps should be trying to reproduce some of the functionality from Ev's Matlab scripts in Python as described in the thread. This will open up the functionality of the database to a much larger group of hackers who don't easily have access to Matlab.

slarson commented 11 years ago

BTW @JimHokanson if you are new to the Python world from Matlab, the following references should be helpful:

http://wiki.scipy.org/NumPy_for_Matlab_Users http://python-for-matlab-users.googlecode.com/files/tutorial_slides.pdf

JimHokanson commented 11 years ago

@slarson Thanks. Where should I put the code? Personal repository that gets forked later? I'd also like to document what I've read about so far (summarizing this issue and the group thread and looking at the code). Suggestions?

slarson commented 11 years ago

@JimHokanson for code I would recommend starting with creating multi-file gists and when there is enough code transitioning to a repo under the openworm/OpenWorm repo.

For documenting what you've read so far, sharing a link to a public Google doc would be excellent.

If you'd like to set up a hangout some time to discuss this further, let me know.

Thanks for jumping into this!

AndrewPa commented 11 years ago

Right now I'm working on expanding @slarson 's code such that it processes every filename on the ftp server as it is downloaded. It seems Stephen was looking to replace the complicated directory structure with descriptive filenames that could me more easily managed in a single directory. I may also change the formula of the filenames to eliminate the whitespace, use the filenames later as hooks for bulk analysis.

@JimHokanson I can't make heads or tails of the h5py output. Since you have experience with MatLab, could you download a couple of .mat files from Laura Grundy's ftp (posted above) and describe what sort of data each one contains? It might give me us an idea as to how to take apart the files in a useful way.

JimHokanson commented 11 years ago

@AndrewPa I too have a modified version of the code which I think will be an improvement over @slarson 's version. I have a few things I need to clean up but I'll try and commit and push what I have later tonight. Ev (our data contact person) is giving a presentation this week on a database system that might significantly change the details of this approach. I'm curious as to your Python/programming/science background?

I have some notes already on the structure of the content of the files, but again I wanted to wait until after the presentation. There is a free viewer program you can use to look at the files if you like: http://www.hdfgroup.org/hdf-java-html/hdfview/

@slarson has mentioned setting up a group meeting after the presentation ...

I have a lot of ideas on where to take this (this being in my mind model validation via motion analysis and comparision). I'll try and write up a formal document before the meeting. I think it is best to have a more formal plan and then to subdivide tasks appropriately.

Does that sound alright? I don't mean to discourage progress. I'm excited to have someone else working on this.

@slarson feel free to comment on your desire to get something soon versus starting base development of something more substantial. My personal preference is to spend a lot of time thinking and checking what's out there before starting to code.

AndrewPa commented 11 years ago

@JimHokanson I'm a recent B.Sc. graduate (molecular biology/genetics). I've been doing bioinformatics work for about a year now, mostly using the Unix shell. I've done a fair share of scripts before. I've taken a couple of courses in Python (3.x and 2.x) and am working on JavaScript as well. I'll be using Python 2.7.x for now since most module fetches seem to be specific to this version. I see very little reason to migrate at this point, but who knows?

I'd prefer to follow @slarson 's suggestion of creating gists of new small scripts before doing VC just yet. What exactly do yours do? Also I'd prefer if we could just discuss things as they come up instead of waiting for meetings.

slarson commented 11 years ago

Hi guys -- thanks so much for the discussion so far. Looks like we have a range of views on how to get started, which is great. My general approach is to avoid putting up roadblocks to anyone who wants to get involved. I also appreciate the perspective of having a lot of things laid out well in advance, but I find this in general happens iteratively.

I am now sending a poll for us to meet with Ev and Andre next week so we can do a face to face and hash through some issues. Even if their paper isn't quite published yet, it sounds like they can accelerate us greatly as soon as the worm meeting is over, so I would take them up on that offer now, and then we can always debrief on the paper when it comes out.

@JimHokanson it sounds like your strength in MatLab will be extremely valuable for digesting Ev's code when it is available. However, in advance of that, @AndrewPa is motivated to play with the data sets using what we have right now and I'm interested to see what he discovers.

@JimHokanson Great find on the HDF5 viewer! If either of you (or anyone else) would like to post your experiences using it on one of our datasets, perhaps with screenshots, I think that would be most helpful.

The idea behind this GitHub issue was to convince ourselves we could close the loop of 1) get the files, 2) read the files, and 3) plot a time series in the file. The code for this would essentially be the "hello world" we need to get fancier, like determining which time series we most want, what pieces of Ev's Matlab code we want to script in Python first for OpenWorm, and then also start sizing up the different data sets.

I also invite you both to add new issues and assign yourselves if you are enjoying going in a different direction than the three steps above with this data set. Worst case, it is redundant or we don't get to it. Best case, it motivates another person to jump on and help! For example, I just generated #126 as another avenue we need to explore about the database.

Looking forward to the code review of what you guys have been working on.

Thanks!

AndrewPa commented 11 years ago

In the interest of jumping into things, here are the (very modest) components of the script I was talking about before:

https://gist.github.com/AndrewPa/5855669

@JimHokanson Did you have something similar? Please post your own gist :)

JimHokanson commented 11 years ago

@AndrewPa Sounds good. I am running my listing now. I like your function name! I've uploaded my gist at: https://gist.github.com/JimHokanson/5856338

I was running the code but it just threw some cryptic error ...

I need to add a couple of lines of saving code and then no one should need to run the code again unless they want to resync. I also might try changing the way I change directories as the full paths may be unnecessary and slow things down. Going forward a set of steps seem to be:

1) Creating a separate name list of file names. My function currently holds the complete path 2) Creating a function which allows you to request the data for a particular name. If the data doesn't exist yet on local disk it would fetch from the remote disk. This could take as its input the full path then save locally as just the name. 3) Determine what methods are used for hdf5 data processing. The structures are deeply nested and it seems like it would be useful to grab them by name, i.e. my_data = getField(f,'level1.level2.level3.level4')

Going forward, other than updating the code to handle saving, I will not be working on this aspect of issue 82. This is the first python code I've written (except the first few lessons at CodeAcademy). At this point in time I need to get a better handle on my development environment and debugging.

@slarson Thanks for saying what I was basically thinking. I like getting people involved and in letting people move forward. Ev's hesitancy to share some info has made it a bit awkward (with regards to his concern about not sharing his thesis and the details contained therein). That being said I'll try and write up a document on the file contents. From talks with Ev it wasn't clear to me how stable (consistent) the contents are. Finishing this issue as stated sounds like a good goal. The "hello world" analogy is nice.

Other questions:

AndrewPa commented 11 years ago

@JimHokanson @slarson Hold on a second... why are we making our own scripts to retrieve files from an FTP server when we can use an FTP client like FileZilla? I think the main concern is parsing the HDF5 data. Have we been barking up the wrong tree...? It was good practice though...

P.S. I used HDF5 viewer and found that the data we're probably interested is under the "worm" directory. It contains a ton of positional data. @JimHokanson can you use MatLab to see if any of the number tables in the "worm" directory from HDF5 viewer correspond to some sort of position-time information?

JimHokanson commented 11 years ago

@AndrewPa I think that is a fair question. For me it was largely an issue of just playing around with Python! Some thoughts: 1) Filezilla does not provide an easy way, from what I can tell, to list the directory contents recursively and save those results to disk for later processing. If you only want to look at one file at a time this is not an issue. 2) In general, I prefer to have programmatic access to things as it makes being able to do what I want easier instead of relying on someone else. Let's say I wanted to find new files via comparison of my list with the new directory structure. I could use date modified information to make this process much faster. I could also run a query using programming on which videos are available with food of a given strain of a given sex, etc. A directory structure is a poor data interface format that ideally is hidden from the user as much as possible. 3) Ev, the person providing the data, has alluded to a new database system which might make any FTP approach irrelevant.

I agree it was good practice. I've learned quite a few things about Python by doing it!

Regarding the data, I'll try and take a look at it later tonight, maybe around 8PM EST. In the mean time the following might be useful:

1) Finding or writing a script to print the data in a nice way, perhaps pprint works? 2) Determining the best way to address the data, in Matlab this is really simple, e.g. my_value = worm.posture.skeleton.x 3) Incidentally worm.posture.skeleton.x and worm.posture.skeleton.y provide the positional data for the worm. I'm not sure where the timing is ...

Best of luck

AndrewPa commented 11 years ago

Alright. I'll see what I can do with the files' output. I'm going to do some more research on the format later today.

slarson commented 11 years ago

I'll reply with a code review later (awesome stuff you guys) but right now I just want to make sure you guys get a chance to respond to the poll for the meeting on this: http://doodle.com/grs3pq5nm8bewpns

AndrewPa commented 11 years ago

I ran into a web designer the other day who noticed I was studying up on JavaScript. After we chatted for a while, he explained to me that there's a relatively new data interchange format called JSON that can be used to transfer data structures between formats. It uses JS-like syntax and data structures, which I'm somewhat familiar with. It occurred to me just now that somebody out there in cyberspace must have come up with some way to parse .mat files into this "universal" format. After a quick search, lo and behold...

http://www.mathworks.com/matlabcentral/fileexchange/27169-json4mat/content/html/json4mat_pub.html

I'm going to look deeper into this later on. It could be a very useful intermediate between the .mat files and the output into Python -- or any other language new members may be more comfortable with; JSON constructs can be injected into myriad different systems and languages it seems.

JimHokanson commented 11 years ago

EDIT: Removed Google Drive documentation, moved content to openworm_docs repo END EDIT

As mentioned before, plotting the skeleton would be a good first "Hello World"

The ftplib code I uploaded previously requires a bit of retooling to support disconnects and restarting. I'm getting errors after about 30 minutes of running the code. At this point I'm holding off on any list generation until talking with Ev.

@AndrewPa It isn't clear to me what advantage switching to json would have at this moment. I typically think of JSON as being suited for creating human readable files that one could edit manually if they desired.

Some other notes on file formats:

1) These mat files are stored using HDF5 which is a publicly specified standard for sharing scientific data. The only thing specific to Matlab are Matlab classes which are probably stored as opaque blobs in the file. 2) JSON is not a binary format, which means that conversion to JSON will cause loss of numeric precision. There is a format BSON (binary JSON) but I am not very familiar with it. 3) In case you're curious, some other popular web formats are XML and YAML. Other scientific data formats are listed at: http://en.wikipedia.org/wiki/List_of_file_formats#Scientific_data_.28data_exchange.29 Of particular interest would be the multi-domain formats such as NetCDF and HDF5

@AndrewPa Let me know if you'd like me to complete some aspect of this otherwise I'm leaving the rest to you.

AndrewPa commented 11 years ago

Alright, I'll handle the rest.

AndrewPa commented 11 years ago

Sorry about the hiatus. I'm focusing on my website right now.

I'll use FileZilla to download the directories because it's much faster and more stable than I can write anytime soon. Then I'll modify "filehoover" to rename the files and put them into a central directory, if needed. Not sure if it's going to be necessary.

I decided it would be best to use PyTables to extract the .mat data. I've managed to print out the data structure but not the specific data quite yet. I don't think I'll have time to work on it for a while. But the priority is to take apart a single file first, not to re-organize the data structure.

PeterMcCluskey commented 11 years ago

How would behavior data be useful without corresponding sensory input data?

I see a bit of environmental data in the repository, but I don't understand how we could construct a useful model of how the environment is affecting the worm's behavior.

Wouldn't it make more sense to forget about this data and plan to use data from Nemaload for validation?

PeterMcCluskey commented 11 years ago

I've written a Python script (ViewWormBehaviorData.py) to generate graphs of the timeseries in a file that has been downloaded:

https://gist.github.com/PeterMcCluskey/6418155

I haven't tried to understand what the numbers mean. I don't plan to do anything more with this in the near future.

vellamike commented 11 years ago

@PeterMcCluskey could you give a link to an example file to work with?

PeterMcCluskey commented 11 years ago

@vellamike - see the links at the bottom of the first comment in this thread.

vellamike commented 11 years ago

@PeterMcCluskey just tried it and it works well - really nice work! Some of the pngs (e.g /locomotion/turns/omegas/frames/time) are blank plots with no data on them - what is the reason for that?

JimHokanson commented 11 years ago

@PeterMcCluskey Thanks for gist.

@MichaelCurrie has agreed to help with getting the python code moving

@vellamike The plots indicate that none of those events occurred. In the particular case you mention, the worm never did any omega bends during the recording.

More info to follow in the next few days.

JimHokanson commented 11 years ago

Gist for plotting worm movement. https://gist.github.com/JimHokanson/6425605

More details on what the fields mean, where we're at, and ideas on where we might want to go to follow in the next few days.

JimHokanson commented 10 years ago

All. I've written a brief progress report at: https://github.com/JimHokanson/openworm_docs/blob/master/Movement/Data/MRC_HDF5/progress/ProgressReport_2013_09_04.md

@slarson Let me know if you'd like to see anything else before closing this issue. At some point we'll need to discuss the next set of steps.

gidili commented 10 years ago

Woah nice write-up - would be nice to have more of these documenting what's going on in the various areas of the project and maybe consolidate all of them in the wiki under the root openworm repository

JimHokanson commented 10 years ago

Closing this issue. Going to be starting a few new ones branching off from this ...