Setting up neuro.pangeo.io

jhamman commented 5 years ago

I just sat down with @arokem to discuss setting up a pangeo for the neuroscience community. He is interested in test driving the recent terraform/hubploy work that @dsludwig has been doing.

I know @dsludwig is working on some documentation for the following repositories:

@arokem has some experience with deploying jupyterhub on GCP so may be a good one to test drive the new system as a "experienced user".

arokem commented 5 years ago

👋 Hi everyone! Thanks for introducing me to everyone, @jhamman! I am looking forward to working with you all on this. Let me know when you have documentation up on these repos. In the meanwhile, I'll be trying out the existing documentation (now with #378 merged!) in the next couple of days.

rabernat commented 5 years ago

I feel like @choldgraf may be interested in following this issue.

choldgraf commented 5 years ago

yessssssss

jhamman commented 5 years ago

@arokem - how's this going? It sounds like you've been making progress using the setup docs on our website? Is there anything more we can do to help you out?

arokem commented 5 years ago

Thanks for checking in!

I took the opportunity to move along the deployment/setup docs in #402 (hopefully in the right direction!), and I have been experimenting a bit more with some relevant datasets (and with zarr!), but it's been a bit slow-going. I'll need to do some work with my collaborators here to get some more datasets and analysis up and running, before starting up the actual thing. Might be a few more weeks. So, no help needed on that end.

I am still waiting on documentation for the repos you mentioned, for the even more automated version of this.

On Mon, Oct 1, 2018 at 9:48 PM Joe Hamman notifications@github.com wrote:

@arokem https://github.com/arokem - how's this going? It sounds like you've been making progress using the setup docs on our website? Is there anything more we can do to help you out?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/399#issuecomment-426147931, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHPNvM1mqu_4jZUCOZYG8IyR2eGBlDJks5ugvANgaJpZM4W7TI2 .

choldgraf commented 5 years ago

@arokem do you have a project roadmap of some sort for this? If the jupyterhub/binder community can be helpful please let us know!

dsludwig commented 5 years ago

@arokem I've put up some documentation on the setup of https://github.com/dsludwig/example.pangeo.io-deploy

Let me know if there's anything that needs clarification.

arokem commented 5 years ago

Hi @dsludwig : thanks! I will take a look (slowly -- I am traveling a lot in October...).

@choldgraf : Thanks for offering your help! I am sure that we will need it :-)

My road map for progress on this is based on two major use-cases that I would like to explore within a neuro.pangeo.io:

Analysis of diffusion-MRI data from human brain. This is a use-case in which data is stored in a domain-specific, not necessarily cloud-optimized, format. But we've explored use of Dask in this context in the past (http://www.vldb.org/pvldb/vol10/p1226-mehta.pdf). There are multiple interesting publicly available data-sets in this domain that we could immediately link up to. For example: https://healthybrainnetwork.org/
Analysis of time-series from electrophysiological datasets. Data in this domain is often stored in HDF5 and there is potential to be a bit more creative with cloud-optimized storage. We might even find a way to use zarr and xarray. There are some publicly available datasets (e.g., https://registry.opendata.aws/allen-brain-observatory/), and I would want to also explore some data that will be made available through a large-ish collaboration that is based here at UW. Would be good to start with a small proof-of-concept there. @choldgraf : do you have any examples of human electrophysiology processing with Dask/XArray?

The motivation for these two use-cases is two-fold: the first is that I am funded to work on ways to use datasets from both of these experimental modalities in cloud-computing environments. So, I can get to work on these two right away.

The other is that they are sufficiently broad to solve problems for a large community of users, without trying to solve all of the possible problems. I think that there is more than enough work to do in just these two use-cases to keep us busy for a little while.

I think that the first set of experiments that I would like to start doing have to do with the MRI use-case because that is one that I am more personally familiar with. Once I have a proof-of-concept there, I can use that to advocate with my collaborators for experiments with the time-series examples.

Does that answer your question? Is there anything that you would like to do on this, or ideas for other use-cases that we should think about tackling?

choldgraf commented 5 years ago

@arokem some quick thoughts:

I haven't found any good examples of cloud-optimized electrophysiology processing, though @dengemann and @agramfort might have some thoughts (both work with the MNE-python project and I believe have done some work on large-scale analysis of ephys data).

I've often felt like XArray would be a really nice candidate for timeseries in neuroscience...that'd be an interesting project to explore (though I imagine it'd take a few tries before the right data structure would be settled).

agramfort commented 5 years ago

hi everyone. For MEG-EEG so non-invasive ephy data we tend to have custom format that MNE can read. We have use cases with biggish datasets such as Cam-CAN http://www.cam-can.org/index.php?content=dataset or https://www.mcgill.ca/bic/resources/omega

So far our team still works with big shared servers and NFS disks but we'd love to be more agile on this.

arokem commented 5 years ago

Thanks @choldgraf and @agramfort for the input.

Re: XArray, there is some interesting work here: https://pennmem.github.io/ptsa_new/html/index.html that tried to use XArray as a basis for analysis of EEG data.

I guess we can build off of that with ephys data converted to zarr or HDF5?

rabernat commented 5 years ago

I will just weigh in and say this: if you are converting from a legacy data format to a new format, and you are not already locked into a particular format, I would recommend to use zarr and instead of HDF5. @mrocklin's blog post summarizes some of the reasons why HDF5 is not the best choice for cloud. Zarr generally works very well in our experience. The downside is that it is not (yet) widely adopted an therefore somewhat unknown / untrusted. But as more people adopt it, this begins to change...

jhamman commented 5 years ago

@arokem / @choldgraf - if either of you have any questions about how to use xarray or build on top of xarray, we'll be quite happy to to engage with you on those points too. It is nice to see PTSA has adopted xarray as well. These domain specific packages are exactly what we (xarray devs) are hoping to see more of.

arokem commented 5 years ago

@dsludwig : I started taking a look at the example deployment repo.

I am configuring CIrcleCI and have a couple of comments/questions:

It would be good to give a bit more information about how to set up the contents of the READWRITE_KEY variable. This entails creating a service key on GCE, but it's not clear what roles this service key should have.
It's also not entirely clear how to create the image that will be specified in IMAGE_NAME. In particular, is there a Dockerfile somewhere that specifies the basic docker image for the pangeo deployment?

mrocklin commented 5 years ago

I recommend that if @dsludwig has time that he just create a neuro.pangeo.io and give @arokem auth over the resulting respository. That should simplify things on @arokem 's side so that he can move on to thinking about software environments, building examples, etc..

arokem commented 5 years ago

If making example.pangeo.io-deploy usable by a broader set of users is a goal at this point, I am happy to iterate on this.

mrocklin commented 5 years ago

I appreciate your willingness here. This is a goal, yes, but it's probably less of a goal than getting people engaged building out examples and attracting other folks to the process. There is also plenty for us to learn once we get to that stage, and I think that it's probably a better use of everyone's time.

Once we've done this a couple of times then yes, it might make sense to start looking at how to improve the setup process. I'm inclined not to bottleneck the usage experience on this though while we start out.

mrocklin commented 5 years ago

That being said, if you're interested in the setup process then don't let me stop you. I'm sure that @dsludwig will enjoy the collaboration.

dsludwig commented 5 years ago

@arokem

It would be good to give a bit more information about how to set up the contents of the READWRITE_KEY variable. This entails creating a service key on GCE, but it's not clear what roles this service key should have.

Good point. For clusters that are using the Pangeo allocation, I have created a service account for this: pangeo-automated-deploy. If you're on a separate GKE project, the roles that are required are:

Kubernetes Engine Admin
Storage Admin

It's also not entirely clear how to create the image that will be specified in IMAGE_NAME. In particular, is there a Dockerfile somewhere that specifies the basic docker image for the pangeo deployment?

The image is created from the https://github.com/pangeo-data/example.pangeo.io-deploy/tree/staging/deployments/example.pangeo.io/image subdirectory. It uses repo2docker, the mechanism behind mybinder.org, to create the Docker image. For example, if you need additional conda packages, you modify the file binder/environment.yml. If you want to add examples, you can put them in that directory.

When CI runs, it will generate an image using the name configured in IMAGE_NAME and a label derived from the git hash, and configure the JupyterHub instance to use that image.

jhamman commented 5 years ago

@arokem and I intersected briefly yesterday. My take is that he's eager to move forward and would be able to provide @dsludwig some useful feedback on the setup process documented in example.pangeo.io. My understanding is that he was thinking of deploying his system on a GCP project other than Pangeo (@arokem - speak up if this is not correct and we can sort out getting you on the Pangeo project). This means he'll need to replicate some of the service account setups and I'm guessing it would be useful to interact with @dsludwig during that part, as well as the CI setup parts.

@arokem - can you update us on where your deployment is currently and what your next steps/timeline are?

arokem commented 5 years ago

Yes. The goal is to deploy our instance on a separate GCP project than the current Pangeo deployment. I think that it would be broadly good to document the process from end to end, and I am happy to serve as the guinea pig for this (but if it's not a priority that's fine too).

Among other reasons because the timeline I have in mind is to have a deployment up and running sometime in the spring. That means that I am not in a huge rush to get things up and running quite yet, and could spend some time debugging. Until then, I am going to try to understand things a bit better and work on some examples and use-cases.

And, as you might understand from my slowness here, my bandwidth to spend time on this is rather limited, so it will have to be a bit slow on my end.

That said -- @dsludwig : I think that those two answers give me what I need and I will try to forge ahead with this information.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

choldgraf commented 5 years ago

@arokem is the bot right? :-)

stale[bot] commented 5 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

arokem commented 5 years ago

I am finally coming back to this. It seems that I can't reopen this issue, but will continue the conversation here.

I am working with the example.pangeo.io-deploy repository, and posted an issue there: https://github.com/pangeo-data/example.pangeo.io-deploy/issues/5.

Is that the best path forward here? I will need to be able to change the image configuration a bit, because we have a slightly different set of dependencies for our use-cases.

jhamman commented 5 years ago

@arokem - it sounds like you have your deployment up and running. Anything else to address here? Should we close this out?

Side note, can we ask you to write a short blog post on https://medium.com/pangeo describing your use case and how pangeo is serving as a platform for neuroscience.

arokem commented 5 years ago

Yes. We can close this one (looks like I can't? No "close" button on my end).

Let's wait with the blog-post for a bit longer, until we have a more interesting story to tell, if that's OK with you. The collaboration has only started using it, and it will take a bit of time to people really ramp up and do something remarkable with it. Revisit in late spring or early summer?

jhamman commented 5 years ago

Sounds good @arokem. Just putting it on your radar. Cheers!

pangeo-data / pangeo

Setting up neuro.pangeo.io #399