icesat2py / icepyx

Python tools for obtaining and working with ICESat-2 data
https://icepyx.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
203 stars 101 forks source link

Remove intake catalog from Read module #438

Closed rwegener2 closed 10 months ago

rwegener2 commented 11 months ago

Goal

Remove the intake library from the Read module. The primary reason for doing this is that intake can't read s3urls, so we'd like to rely just on xarray for data reading. An additional benefit is the possibility of simplifying the read workflow.

How it was done

As of the WIP version the intake module was just taken out. I'm left feeling like I'm missing something, because it doesn't seem like I need to add any functionality back in, I just took a bunch of stuff out.

There were previously 2 functions in is2cat.py -- _pattern_to_glob was moved a stand alone function into read.py and build_catalog was deleted entirely.

How it can be tested

I'm running the code from the Quick start guide of the Read docs with some local data (ATL06).

Todo

github-actions[bot] commented 11 months ago

Binder :point_left: Launch a binder notebook on this branch for commit 24f6a42df3b06bf93e2f16e90434a70aef5d4b1c

I will automatically update this comment whenever this PR is modified

Binder :point_left: Launch a binder notebook on this branch for commit b4318cfd06b73ad42a7fe19c0deef273c0c288a3

Binder :point_left: Launch a binder notebook on this branch for commit b13b8473644589445c79144f6e53a21b7447c383

Binder :point_left: Launch a binder notebook on this branch for commit 1cfbf7208a8de80a1539d25e0c536f6831330c25

Binder :point_left: Launch a binder notebook on this branch for commit de61d87e21ccb4e1feb0e289b83a66ffc1690566

Binder :point_left: Launch a binder notebook on this branch for commit 9f066112c8b73200128e18eed8e6242d15329da6

Binder :point_left: Launch a binder notebook on this branch for commit d019b9a2db2391882dbc69c4816f64eb03f389e3

Binder :point_left: Launch a binder notebook on this branch for commit 156ea89103f44dd4b7be680ac51eecb5e1b85b18

Binder :point_left: Launch a binder notebook on this branch for commit b26ca4eb0ed54cf075bf6c53f31cd78114e5c182

Binder :point_left: Launch a binder notebook on this branch for commit ce1ca76b7e2d586eaba3695308fae0e0bcd4805f

Binder :point_left: Launch a binder notebook on this branch for commit fcc419d030f3c38655e287d3cbb2db0d501c9ed2

rwegener2 commented 11 months ago

Hey there icepyx crew 👋🏻 After looking into this a bit I have a few questions about the Read module.

  1. It seems like there are multiple ways in which the product or version of the data product is checked [UPDATE: addressed in next comment]:

    • product is a required input
    • In Read._check_source_for_pattern we check the filenames for a consistent pattern using _pattern_to_glob
    • There is an assert ( file_product == self._prod ) check in _build_single_file_dataset. Do we need all of these? Or, the bigger picture question is: what do we most trust for telling us the product of the file? I would make a case that the best thing to trust is the metadata inside the file, in which case we can remove the user input, the assertion, and a lot of the filename pattern checking. The only thing we would need to keep is some code that does the proper checks if a list of files is given as the input (see next question).
  2. On the topic of multiple input files, I didn't realize that icepyx accepts multiple files. Do we want the users to be able to give a list with different products in it, or is the goal to enforce that all the files in the list should be the same product?

  3. Finally, I'm looking at the failing tests. What is the purpose of the test_check_run_fast_scandir test in test_read.py? It looks to me like it's checking that all the modules exist in the library, but I'm not sure why that would be needed. Have there been import problems in the past?

cc @JessicaS11

rwegener2 commented 11 months ago

I went ahead and experimented with taking out the product argument, instead reading the product from the file. This addresses point 1 from the previous comment.

The commit I just pushed works so long as the data_source argument is a single file. To continue I'm considering again point 2 from the previous comment - what are the constraints if the user provides a directory? Are mixed files products allowed?

On a related note, if we move forward with pulling the product from the metadata we don't need to do as rigorous of checks for the filename pattern. I like that the existence of the filename_pattern allows a user to provide multiple files - I wonder if we could use the data_source argument to maintain this functionality. data_source could be either a string or a list, and the string could be run through Python's glob module to check for multiple matching file patterns. Thoughts?

JessicaS11 commented 11 months ago
1. It seems like there are multiple ways in which the product or version of the data product is checked:

* `product` is a required input

* In `Read._check_source_for_pattern` we check the filenames for a consistent pattern using `_pattern_to_glob`

* There is an `assert ( file_product == self._prod )` check in `_build_single_file_dataset`.
  Do we need all of these? Or, the bigger picture question is: what do we most trust for telling us the product of the file? I would make a case that the best thing to trust is the metadata inside the file, in which case we can remove the user input, the assertion, and a lot of the filename pattern checking. The only thing we would need to keep is some code that does the proper checks if a list of files is given as the input (see next question).

Initially I thought maybe, but now that I'm done writing this paragraph I think probably not (happy to hear suggestions to the contrary). Product is required input because if the user has director(ies) with multiple products worth of ICESat-2 granules, we need to know which ones they are trying to open (and there is a limitation that they can only open one product type per object since some of the products have such different group/variable structure). Then, Read._check_source_for_pattern is used to actually identify which files match the product the user specified and collect a list of those (so the user doesn't have to create this list). Then the check in _build_single_file_dataset is comparing the expected product (self._prod) with the file metadata (file_product) to confirm that the data in the file actually is the expected product based on the filename (in case the user renamed the file incorrectly).

2. On the topic of multiple input files, I didn't realize that icepyx accepts multiple files. Do we want the users to be able to give a list with different products in it, or is the goal to enforce that all the files in the list should be the same product?

Yes - combining multiple granules is what makes the read module so powerful (and complicated). It will take all your same-product files and spit you out a single object with them all combined (so we unimpose the arbitrary spatial and temporal boundaries of granule/file storage)! If the user wants to read in multiple products, we'd need to handle each product separately internally because there is such variation in the group/variable structure and naming and (most likely, but I could be wrong) return to them an object per product, which they'll have to combine intelligently. Ultimately it would be cool to have this capability built in; I wonder what the demand for this type of workflow is versus other development goals.

3. Finally, I'm looking at the failing tests. What is the purpose of the `test_check_run_fast_scandir` test in `test_read.py`? It looks to me like it's checking that all the modules exist in the library, but I'm not sure why that would be needed. Have there been import problems in the past?

The _check_run_fast_scandir function is what would be combing the user's file system (based on the directory supplied and any subdirectories) to generate the list of files to read in. For the purposes of testing, I set it loose on internal files to test the function, because it was a reliable set of directories and files that always exist in the testing environment. Someday when we get a set of test files available it would make sense to use those for the test instead... it seemed like one of those cases where it provided a viable way to test the function itself, even if the results are meaningless. Would it be better to just not test the function in a situation like this?

rwegener2 commented 11 months ago

Thanks for the detailed comment @JessicaS11. It's helpful (and really cool that icepyx allows users to pull multiple granules together!). I agree that allowing mixed product types does sound very difficult, and in some cases like you may be allowing a user to do something that wouldn't even make sense.

So, if I'm understanding, a summary of the two options is:

Existing Flow Drop product argument and accept a list or glob pattern via data_source
Arguments into Read() [4] data_source, product, filename_pattern, out_obj_type [2] data_source, out_obj_type
Pros 1) user flow in which a user can easily keep multiple products in the same folder, 2) not disrupting current code 1) not requiring that the user maintain the default filename, 2) fewer input arguments (both for user and for how our internal code keep track of)

In either case we could have it so that the user could store their data how they want, we are just changing under which scenario a filename_pattern/list has to get created by the user. (ex. In the existing setup they can change the filename, but they have to give a filename_pattern. In the drop product option a user could keep multiple products in the same folder, but they would have to provide either a list or a glob-able string to data_source).

Am I missing anything? I could go either way. I sort of like fewer inputs to keep track of for both users and developers, but I also want to properly hesitate before proposing a bigger change like that.

@JessicaS11 If at any point you think it'd be easier to jump on a quick call and brainstorm pros/cons I'm happy to do that.

rwegener2 commented 11 months ago

To summarize a discussion @JessicaS11 and I just had on a call, proposed changes are:

  1. combine the filepath_pattern and data_source arguments. The single argument may be either: a string (either a filepath or a glob filepath), a list of files, or a Query object. When the cloud reading feature is built, glob strings will not be allowed on s3 filepaths
  2. We are no longer going to require the product argument from the user. It will still be optional, both for the purposes of deprecation, and also to be used in the event that a user provides a list of files that do not all contain the same product.

@JessicaS11 made the very excellent observation that these two changes could be addressed in a separate PR, so I'm going to wrap this one up and address these comments in the next one.

review-notebook-app[bot] commented 11 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

rwegener2 commented 11 months ago

Alright, this is ready for review! I ran the IS2_data_read-in notebook as a simple test. I had to skip some of the cells to get them to run without error (ex. using my own filepath, or not running the 2nd to last line to remove all the variables), but overall it seems to be working. I'll be excited to hear feedback!

Not sure who the best reviewer is so I requested you both @weiji14 and @JessicaS11. Let me know if I should be doing it another way.

rwegener2 commented 11 months ago

Could you also remove intake (and intake-xarray?) from requirements.txt

Ah, good find! Thanks for this comment, and the other ones, @weiji14.