pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.55k stars 1.06k forks source link

xarray tutorial at SciPy 2018? #1882

Closed rabernat closed 6 years ago

rabernat commented 6 years ago

It would be great to hold an xarray tutorial at SciPy 2018. Xarray has matured a lot recently, and it would be great to raise awareness of what it can do among the broader scipy community.

From the conference website:

Tutorials should be focused on covering a well-defined topic in a hands-on manner. We want to see attendees coding! We encourage submissions to be designed to allow at least 50% of the time for hands-on exercises even if this means the subject matter needs to be limited. Tutorials will be 4 hours in duration. In your tutorial application, you can indicate what prerequisite skills and knowledge will be needed for your tutorial, and the approximate expected level of knowledge of your students (i.e., beginner, intermediate, advanced).

I'm curious if anyone was already planning on submitting a tutorial. If not, let's put together a team. @jhamman has indicated interest in participating in, but not leading, the tutorial. Anyone else interested?

xref pangeo-data/pangeo#97

benbovy commented 6 years ago

I plan to attend SciPy this year and I'd be happy to join in.

rabernat commented 6 years ago

What is needed is someone to lead the tutorial team (including submitting the application). You will clearly have help from @jhamman and @benbovy in presenting the tutorial. You will also get a stipend of $1000!

I can't volunteer because I'm not sure I'll even be able to attend the conference.

jhamman commented 6 years ago

If I can get some help putting together the materials, I can lead tutorial. @kmpaul and I are already preparing an Xarray tutorial for April (link) so really we just need to adapt what we come up with for that to be useful for Scipy.

mjbrodzik commented 6 years ago

I would be happy to assist. I have asked for travel funds to attend SciPy this year, but have not yet gotten approval. Even if I can't go I could assist with review of materials if you are interested in feedback. If I can go, I could certainly work as a tutorial helper.

ktyle commented 6 years ago

Although not active on the Xarray github, I am an early adopter and active user of the software and am looking for a good excuse to go to scipy for the first time ...I would be glad to assist!

jhamman commented 6 years ago

I submitted an abstract for an xarray tutorial today. More information to come as we get closer to the conference but for now the title: Xarray for Scalable Scientific Data Analysis. Stay tuned!

gajomi commented 6 years ago

Xarray for Scalable Scientific Data Analysis

Nice title!

I know xarray has its origins and most of its current users in the earth science domains, and so I would expect much of the core of an xarray tutorial to involve various geo* flavored data, but since SciPy has attendees from so many different backgrounds it could be useful to try to survey the scope of work being done with xarray right now. I imagine there must be other users in astronomy, physics, biology and perhaps even quantitative civics/demography that could have interesting snippets to share.

For my part, I am using xarray to work with microscopy data in a biological context, and would be happy to share a snippet or two.

rabernat commented 6 years ago

I don't think xarray has caught on in astronomy and I'm curious why. From an outsider's perspective, it seems ideal for astronomy data.

Maybe because they already have things like astropy and yt?

jhamman commented 6 years ago

@jakevdp - do you know of any astronomy applications of xarray?

jakevdp commented 6 years ago

I've not seen any... I think the main reason is that the field embraced Python years before xarray was created, so there were already workable solutions in place.

fujiisoup commented 6 years ago

My colleague in astronomy said that his common data format has been a set of few images taken with long exposure time and he didn't need to take care of big data until recently. I am not sure it is generally true for astronomy field. However, one of the recent streams in astrophysics is definitely the combination of the statistics and the huge amount of measurements, such as thousands of images constantly taken by telescopes. I suspect xarray could play more role also in this field (I am also an outsider though...).

fujiisoup commented 6 years ago

For my part, I am working in the nuclear fusion field, where we have many kinds of high-dimensional measurement data. The size of each measurement is not so huge, but we have huge kinds of data taken on different coordinates. xarray also fits such situation. (I am also happy to share my snippest but my data is not big and I am not sure this fits the tutorial concept.)

xarray certainly helps me a lot, but I don't hear any usages of xarray around me. It might be a historical reason (many are still using a comersial software such as IDE). I think there is a certain market also in my field.

GiorgioBalestrieri commented 6 years ago

@jhamman will your tutorial at UCAR be available online at some point? Are you still planning to present at SciPy 2018? I'm not part of the dev team but I really think it would be great to have a proper video with a tutorial and the related repo. It would really help getting people aware of/excited about xarray!

jhamman commented 6 years ago

@GiorgioBalestrieri - Yes. Both tutorials will be made available. Stay tuned!

jhamman commented 6 years ago

We heard over the weekend that the xarray tutorial was not selected for Scipy 2018. From reading the reviewer comments, it sounds like we (mostly I) did not provide a sufficient outline of topics to fully describe what the tutorial would cover. This seems mostly like a misunderstanding on my part as to the expected level of detail in the abstract.

In the hopes that we'll be able to get a slot for one of these in the next few years, I'll post both the abstract and review comments here.

Abstract:

Xarray provides data structures for N-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets with many related variables, such as those that occur in the disciplines of earth science, astronomy, and finance. Xarray combines the power of labeled data structures from Pandas, with the N-dimensional arrays from Numpy and parallel out-of-core computation from Dask, to provide an intuitive and powerful platform for scientific analysis of large multi-dimensional datasets.

This tutorial introduces data scientists who may already be familiar with Numpy or Pandas to the Xarray data model and tool kit. Following an introduction to Xarray, we will introduce tools for scaling real-world scientific data analysis workflows using Xarray and Dask. Students will leave this tutorial with 1) a comprehensive understanding of the Xarray data model, 2) the ability to apply the Xarray tool kit to analysis workflows that fit in memory, and 3) the ability to scale those same workflows to datasets that are much too large to fit into memory (GBs to TBs). Participants are expected to have some familiarity with Jupyter, Numpy, and Pandas.

Links: http://xarray.pydata.org, http://dask.pydata.org

Short Description of the Tutorial:

Xarray provides data structures for N-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets with many related variables. Xarray combines the power of labeled data structures from Pandas, with the N-dimensional arrays from Numpy and parallel out-of-core computation from Dask, to provide an intuitive and powerful platform for scientific analysis of large multi-dimensional datasets. This tutorial introduces data scientists who may already be familiar with Numpy or Pandas to the Xarray package. We will guide participants through the process of scaling Xarray computations from small to big data science workflows.

Review Comments:

Dear Joseph, We didn't select your tutorial, "Xarray for Scalable Scientific Data Analysis", for SciPy 2018, but we would like to wholeheartedly thank you for your submission. The proposals were exceptionally good this year. We received 55 applications and had only 24 spots. We made our selection based on the reviewers' feedback and the likely popularity of the tutorial. We also made a few tough calls to ensure a good diversity of topics and presenters.

Below is the raw feedback from the reviewers who looked at your application. We hope it'll prove useful and we look forward to receiving another proposal from you next year.

Best regards, Alex, Ben & Mike. SciPy 2018 Tutorials Committee

----------------------- REVIEW 1 --------------------- PAPER: 70 TITLE: Xarray for Scalable Scientific Data Analysis AUTHORS: Joseph Hamman

I do not have a conflict of interest.: yes Overall evaluation: 2 (accept) What level of interest do you think this tutorial will generate?: 3 (Widespread appeal)

----------- Overall evaluation ----------- This seems like a very promising presentation quite apt for attendees of the SciPy conference about an underappreciated tool for scientific Python. It also seems like the proposer is well positioned to be the instructor for such a tutorial. However, I am concerned about the lack of specific information about what topics will be presented in what order and with what coding exercises and duration. I worry the proposer has not yet developed details about what would be presented. Right now this sounds more like an hour long tutorial than a 3 or 4 hour long tutorial. It would be good if the presented could develop a more detailed outline, also so that potential attendees have a better idea of what to expect.

----------------------- REVIEW 2 --------------------- PAPER: 70 TITLE: Xarray for Scalable Scientific Data Analysis AUTHORS: Joseph Hamman

I do not have a conflict of interest.: yes Overall evaluation: 1 (weak accept) What level of interest do you think this tutorial will generate?: 3 (Widespread appeal)

----------- Overall evaluation ----------- I don't have experience with xarray, but I believe that it is revolutionary software that will change how scientists use Python to analyze data. I spent way too much time doing nasty tricks with numpy to analyze hyperspectral data in my olden days as a scientist, and I'm convinced that xarray will save people that time today. I would really want past me to see this tutorial.

That said, this proposal needs much more detail. What's your time schedule? What example data will you use? What specific functionality of xarray will you be showing (and when), and how do these examples build on each other? Based on the author's background, I trust that this tutorial will turn out fine if it is accepted, but this proposal would be stronger with more detail. I'm recommending that this tutorial be accepted on the solid foundation of the xarray project, the value that I think that project presents, and some trust based on Joseph's background.

The setup instructions are good - I am excited that cloud-based JupyterHub will remove need of local installation. I do think for your local installation guide, you'd be better off specifying version ranges for each dependency, or a link to some environment specification file, in case APIs change between now and the conference. Not a big deal - people can update or downgrade as necessary - just saves some confusion.

----------------------- REVIEW 3 --------------------- PAPER: 70 TITLE: Xarray for Scalable Scientific Data Analysis AUTHORS: Joseph Hamman

I do not have a conflict of interest.: yes Overall evaluation: 2 (accept) What level of interest do you think this tutorial will generate?: 3 (Widespread appeal)

----------- Overall evaluation ----------- This seems like a very strong tutorial, hopefully with broad appeal to both traditional and data scientists.

----------------------- REVIEW 4 --------------------- PAPER: 70 TITLE: Xarray for Scalable Scientific Data Analysis AUTHORS: Joseph Hamman

I do not have a conflict of interest.: yes Overall evaluation: 1 (weak accept) What level of interest do you think this tutorial will generate?: 3 (Widespread appeal)

----------- Overall evaluation ----------- I know there is interesting xarray and that it is playing playing a role in solving hard problems. I would have like to see a more detailed outline, with plans for exercises and timing information.

----------------------- REVIEW 5 --------------------- PAPER: 70 TITLE: Xarray for Scalable Scientific Data Analysis AUTHORS: Joseph Hamman

I do not have a conflict of interest.: yes Overall evaluation: 1 (weak accept) What level of interest do you think this tutorial will generate?: 3 (Widespread appeal)

----------- Overall evaluation ----------- XArray is a fantastic project, and I have been very interested in seeing it grow and gain wider acceptance. The tool can be tricky to use right, and the documentation can be sparse in some places. So, a tutorial would be extremely valuable here. However, the proposal in very lacking in details, which would make it difficult to judge whether or not there is a 4-hour tutorial (I am sure there is), but also makes it hard for a potential participant to judge whether this tutorial is for them or not.

Note: while I don't have a conflict of interest with the author, I am acknowledged in a paper about XArray as an early contributor.

rabernat commented 6 years ago

Bummer Joe! It sounds like you had a great proposal, but maybe the instructions for the tutorial abstract weren’t clear enough in terms of the detail required.

We should not get discouraged and instead set our sights on other opportunities (including scipy 2019) for presenting xarray tutorials. Recent comments on the mailing list suggest that a comprehensive xarray tutorial is something our community really needs.

Sent from my iPhone

On Apr 1, 2018, at 11:23 PM, Joe Hamman notifications@github.com wrote:

We heard over the weekend that the xarray tutorial was not selected for Scipy 2018. From reading the reviewer comments, it sounds like we (mostly I) did not provide a sufficient outline of topics to fully describe what the tutorial would cover. This seems mostly like a misunderstanding on my part as to the expected level of detail in the abstract.

In the hopes that we'll be able to get a slot for one of these in the next few years, I'll post both the abstract and review comments here.

Abstract:

Xarray provides data structures for N-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets with many related variables, such as those that occur in the disciplines of earth science, astronomy, and finance. Xarray combines the power of labeled data structures from Pandas, with the N-dimensional arrays from Numpy and parallel out-of-core computation from Dask, to provide an intuitive and powerful platform for scientific analysis of large multi-dimensional datasets.

This tutorial introduces data scientists who may already be familiar with Numpy or Pandas to the Xarray data model and tool kit. Following an introduction to Xarray, we will introduce tools for scaling real-world scientific data analysis workflows using Xarray and Dask. Students will leave this tutorial with 1) a comprehensive understanding of the Xarray data model, 2) the ability to apply the Xarray tool kit to analysis workflows that fit in memory, and 3) the ability to scale those same workflows to datasets that are much too large to fit into memory (GBs to TBs). Participants are expected to have some familiarity with Jupyter, Numpy, and Pandas.

Links: http://xarray.pydata.org, http://dask.pydata.org

Short Description of the Tutorial:

Xarray provides data structures for N-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets with many related variables. Xarray combines the power of labeled data structures from Pandas, with the N-dimensional arrays from Numpy and parallel out-of-core computation from Dask, to provide an intuitive and powerful platform for scientific analysis of large multi-dimensional datasets. This tutorial introduces data scientists who may already be familiar with Numpy or Pandas to the Xarray package. We will guide participants through the process of scaling Xarray computations from small to big data science workflows.

Review Comments:

Dear Joseph, We didn't select your tutorial, "Xarray for Scalable Scientific Data Analysis", for SciPy 2018, but we would like to wholeheartedly thank you for your submission. The proposals were exceptionally good this year. We received 55 applications and had only 24 spots. We made our selection based on the reviewers' feedback and the likely popularity of the tutorial. We also made a few tough calls to ensure a good diversity of topics and presenters.

Below is the raw feedback from the reviewers who looked at your application. We hope it'll prove useful and we look forward to receiving another proposal from you next year.

Best regards, Alex, Ben & Mike. SciPy 2018 Tutorials Committee

----------------------- REVIEW 1 --------------------- PAPER: 70 TITLE: Xarray for Scalable Scientific Data Analysis AUTHORS: Joseph Hamman

I do not have a conflict of interest.: yes Overall evaluation: 2 (accept) What level of interest do you think this tutorial will generate?: 3 (Widespread appeal)

----------- Overall evaluation ----------- This seems like a very promising presentation quite apt for attendees of the SciPy conference about an underappreciated tool for scientific Python. It also seems like the proposer is well positioned to be the instructor for such a tutorial. However, I am concerned about the lack of specific information about what topics will be presented in what order and with what coding exercises and duration. I worry the proposer has not yet developed details about what would be presented. Right now this sounds more like an hour long tutorial than a 3 or 4 hour long tutorial. It would be good if the presented could develop a more detailed outline, also so that potential attendees have a better idea of what to expect.

----------------------- REVIEW 2 --------------------- PAPER: 70 TITLE: Xarray for Scalable Scientific Data Analysis AUTHORS: Joseph Hamman

I do not have a conflict of interest.: yes Overall evaluation: 1 (weak accept) What level of interest do you think this tutorial will generate?: 3 (Widespread appeal)

----------- Overall evaluation ----------- I don't have experience with xarray, but I believe that it is revolutionary software that will change how scientists use Python to analyze data. I spent way too much time doing nasty tricks with numpy to analyze hyperspectral data in my olden days as a scientist, and I'm convinced that xarray will save people that time today. I would really want past me to see this tutorial.

That said, this proposal needs much more detail. What's your time schedule? What example data will you use? What specific functionality of xarray will you be showing (and when), and how do these examples build on each other? Based on the author's background, I trust that this tutorial will turn out fine if it is accepted, but this proposal would be stronger with more detail. I'm recommending that this tutorial be accepted on the solid foundation of the xarray project, the value that I think that project presents, and some trust based on Joseph's background.

The setup instructions are good - I am excited that cloud-based JupyterHub will remove need of local installation. I do think for your local installation guide, you'd be better off specifying version ranges for each dependency, or a link to some environment specification file, in case APIs change between now and the conference. Not a big deal - people can update or downgrade as necessary - just saves some confusion.

----------------------- REVIEW 3 --------------------- PAPER: 70 TITLE: Xarray for Scalable Scientific Data Analysis AUTHORS: Joseph Hamman

I do not have a conflict of interest.: yes Overall evaluation: 2 (accept) What level of interest do you think this tutorial will generate?: 3 (Widespread appeal)

----------- Overall evaluation ----------- This seems like a very strong tutorial, hopefully with broad appeal to both traditional and data scientists.

----------------------- REVIEW 4 --------------------- PAPER: 70 TITLE: Xarray for Scalable Scientific Data Analysis AUTHORS: Joseph Hamman

I do not have a conflict of interest.: yes Overall evaluation: 1 (weak accept) What level of interest do you think this tutorial will generate?: 3 (Widespread appeal)

----------- Overall evaluation ----------- I know there is interesting xarray and that it is playing playing a role in solving hard problems. I would have like to see a more detailed outline, with plans for exercises and timing information.

----------------------- REVIEW 5 --------------------- PAPER: 70 TITLE: Xarray for Scalable Scientific Data Analysis AUTHORS: Joseph Hamman

I do not have a conflict of interest.: yes Overall evaluation: 1 (weak accept) What level of interest do you think this tutorial will generate?: 3 (Widespread appeal)

----------- Overall evaluation ----------- XArray is a fantastic project, and I have been very interested in seeing it grow and gain wider acceptance. The tool can be tricky to use right, and the documentation can be sparse in some places. So, a tutorial would be extremely valuable here. However, the proposal in very lacking in details, which would make it difficult to judge whether or not there is a 4-hour tutorial (I am sure there is), but also makes it hard for a potential participant to judge whether this tutorial is for them or not.

Note: while I don't have a conflict of interest with the author, I am acknowledged in a paper about XArray as an early contributor.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

fmaussion commented 6 years ago

I'm actually impressed by the quality and number of reviews, good for Scipy!

I feel particularly concerned about:

The tool can be tricky to use right, and the documentation can be sparse in some places.

I would hope to meliorate this whenever possible.