jsignell / intake-blog

Intake blog materials
3 stars 1 forks source link

First pass #1

Closed martindurant closed 5 years ago

martindurant commented 5 years ago

The article is nice and clear and demonstrates the functionality well. My main criticism, is that it feel like a doc page or tutorial, and that someone who isn't specifically looking for this, won't be enticed in. The point is, this is a very common problem that people probably are building custom code for all the time, and here is a way to be systematic and describe that information in a concise spec.

I would start with something like:

The text is clearly aimed at cat authors. "How to use it" from an end-user's point of view is a simple "load your catalog entry and let intake do magic for you". Perhaps you can explicitly say that the intake system enables you to easily abstract away messy data storage practices like this, so the users of the data don't need to know about it..

If the intent is to have the notebook run on binder, it should have a link on the article for this. You would also need and environment.yaml.

More plugins can be made to respect path_as_pattern notation

Actually, you put in work to make this easier for everyone else. True, the loader may make life harder, but in most cases you could apply the new columns after the fact.

There are likely existing issues

Maybe, but I'd say not so likely! This sounds like you think you did a sloppy job, and I do not agree.

it could certainly be extended to parse single files

You mean, as a generic type of text parser? Good idea. It could do that to more than one file, though.

There are some inconsistencies around what is code (with backticks), what are strings (with quotes) and what is emphesised. The one link didn't work, because of a space after the square bracket.

martindurant commented 5 years ago

Second pass - these are all suggestions only.

Users load the catalog entry and get back the data with all the fields they need - no iterating, splitting, and stripping.

You get to a full dataset from all the files, together with the information from the file-names, in just two lines of code.

with real landsat data.

With real satellite imagery data from the landsat project (link)

Remove "(glob notation is only supported in cases where there is an unambiguous directory structure)" and replace with '(a path containing "*" wildcards)'.

to load precipitation data for a number of emissions scenarios and models

For instance, let's suppose that we have a number of CSV files, containing data from a number of [precipitation | emission | whatever this is] models, one per file. The following glob pattern would match all of the files:

In order to capture the data encoded into the names of the files, we can replace the "" wildcards with field-names, as follows, making what we'll refer to as a path pattern*:

Remove "and the concept of populating data fields from the path using this pattern is path_as_pattern."

argument path_as_pattern can be used to pass the pattern ...

... we want applied to the filenames

can just be a piece of the path as long as it is unambiguous where the piece starts and stops ( '{emissions}Precip{model}' for instance would not yield the intended outcome).

Didn't follow, not sure why the example doesn't match

but what is happening is more like the reverse of string formatting. You can think of the relationship of the pattern to the format string like the relationship between logs and exponents.

but inverse: the set of arguments required such that pattern.format(**arguments) == path

This method is implemented entirely independently of the path and pattern context by setting up a helper function called reverse_format

The logic is implemented in the function intake.source.utils.reverse_format, which we can use to demonstrate how this works.

More plugins can be made to respect path_as_pattern notation

using the helper classes provided in Intake. The specific implementation may depend on the specifics of the third-party library.

it could certainly be extended to parse single files to save users time spent on string stripping and splitting.

however, function may also be useful for parsing other similarly structured text in general

The categories are knowable in advance and the csv implementation in dask should reflect that.

Where information is encoded in the file-names, such as the examples here, there is an opportunity to filter the set of input files, based on some predicate on the derived fields, before even reading any of the data. For example, the field(s) from the file-names could be part of the Dask data-frame index.

martindurant commented 5 years ago

(I may have more comments on the notebooks)

martindurant commented 5 years ago

Comments on landsat.ipynb

I would add an introduction, since this notebook will appear in a general list of examples, and we cannot assume that people only arrive from the blog: we are demonstrating a new functionality within Intake [link], which can parse and make use of information stored in the filenames of a given data-set. This notebook demonstrates the functionality from the point of view of the end-user/data-scientist: you get the information you want, based on a spec in a catalog file that you need to work to create. No more writing messy loops and parsing code to extract the information yourself. Link to the blog post. Similar goes for the CSV notebook, I won't repeat it.

working with landsat data

Say something about what this is: earth imaging in multiple optical bands: red, green (I'm guessing!)

link for NDVI ?

Maybe expand google_landsat_8.description to say where the data comes from. Does it need a copyright notice?

I got the following exception,

ImportError: /srv/conda/lib/python3.6/site-packages/rasterio/../../../libgdal.so.20: undefined symbol: sqlite3_column_table_name

recommend putting the intake channel last, and conda-forge before defaults; recommend a specific version for rasterio. Requires specific sqlite for some reason? Note, this worked before.

comments on csv.ipynb

We will use the read method to load the data

into memory, as a Pandas data-frame, in one shot.

This is useful because it makes the data highly visualize-able.

it is a highly efficient representation of the data and takes up minimal memory. It is also more performant for select and groupby operations. Is visualize-able a word? You mean, you get convenient labels in the output?

Link to hvplot and give a few words about why its great (very simple interface, range of beautiful, interactive plots...). Note that the plots are one-liners.

Make clear that the catalogue author is preparing plot types and arguments as a way to provide quick-look plots to the user.

In the list-of-paths section, you can mention that using the intake.open_csv might be the way that a catalog author starts their work, and show the result of print(southern_rockies_list.yaml())

Could do with a concluding sentence of so, saying that now you have the catalog, you have a single reference of truth for the data-set, no need for copy/paste, and the end-user can get on with their work (link to other notebook, ./landsat.ipynb).

martindurant commented 5 years ago

Final thoughts: