Datasets module - Githubissues

aazuspan commented 1 year ago

This PR would close #3 by adding a datasets module for loading example data (currently just load_moscow_stjoes).

This dataset implementation follows the API of sklearn datasets very closely. All of the parameters to the the load function work identically, and all attributes present in an sklearn dataset are matched in our dataset. The differences are:

It uses a custom Dataset class rather than a modified dictionary. For people using dot-notation to access dataset attributes, this change should be transparent with the exception of better type hints and auto-completion and a more legible repr.
- EDIT: If we wanted to maximize compatibility, we could allow dictionary access (e.g. dataset["features"]) by implementing a __getitem__ method, but I'm not sure if that's worth the trouble for what's probably a rare use case.
For dataframe outputs, the plot IDs are set as indexes to allow storing and returning plot indexes from kneighbors in #37
This adds an ids attribute that stores the plot IDs for the dataset. @grovduck, let me know whether you think this is worth including. Another option would be to just have users use dataframes if they want the IDs.

This PR also refactors most of the tests to use load_moscow_stjoes to access test data. For the porting tests, it adds a load_moscow_stjoes_results function that returns the dataset with results from yaImpute appended to it. While I was making that change, I also parameterized the port tests to reduce duplication.

grovduck commented 1 year ago

If we wanted to maximize compatibility, we could allow dictionary access (e.g. dataset["features"]) by implementing a __getitem__ method, but I'm not sure if that's worth the trouble for what's probably a rare use case.

Personally, I don't think it's worth the trouble. I think dot syntax is much more intuitive and I never use dictionary access with the load_* functions from sklearn.

This adds an ids attribute that stores the plot IDs for the dataset. @grovduck, let me know whether you think this is worth including. Another option would be to just have users use dataframes if they want the IDs.

I want to make sure I understand the implications here. I don't think there is any reference to the ids attribute anywhere in the codebase besides the testing and that attribute would be specific to the datasets._base.Dataset class, right? My biggest concern (and it may not be that consequential) is that users would see this ids attribute associated with the test data and wonder why their data would not have the same (if not imported in the same way). I think it provides some flexibility for users to work with arrays to instantiate Dataset objects with the flexibility for converting to dataframes if wanted. So I guess I'd vote to keep the ids attribute in here in you don't think it will be too surprising to users.

aazuspan commented 1 year ago

I want to make sure I understand the implications here. I don't think there is any reference to the ids attribute anywhere in the codebase besides the testing and that attribute would be specific to the datasets._base.Dataset class, right?

Right! Essentially, it would just be a way to let users access the plot IDs without using dataframes.

My biggest concern (and it may not be that consequential) is that users would see this ids attribute associated with the test data and wonder why their data would not have the same (if not imported in the same way). I think it provides some flexibility for users to work with arrays to instantiate Dataset objects with the flexibility for converting to dataframes if wanted. So I guess I'd vote to keep the ids attribute in here in you don't think it will be too surprising to users.

I agree with all of that! One last question: do you like ids for the attribute name or should I change it to index or something else? index would be more consistent with pandas and might make the connection with return_dataframe_index in kneighbors more clear. Or we could go more specific with something like plot_id?

grovduck commented 1 year ago

I agree with all of that! One last question: do you like ids for the attribute name or should I change it to index or something else? index would be more consistent with pandas and might make the connection with return_dataframe_index in kneighbors more clear. Or we could go more specific with something like plot_id?

index is totally fine by me - I like the connection to pandas and return_dataframe_index. plot_id is probably way too narrow of a name for other applications (if there is such a thing)!

aazuspan commented 1 year ago

Another one down! Thanks for the helpful reviews :)

grovduck commented 1 year ago

Nice job! I'm definitely feeling a bit guilty on how much work you're doing, but I'm very grateful.

aazuspan commented 1 year ago

Don't be! This is fun enough that I barely consider it work.

And I think we'll be leaning on you for a lot of the upcoming features like dimensionality reduction and RF-NN :wink:

grovduck commented 1 year ago

Yep, hope to be developing again next week. Devoting a couple of days to it without interruption!

lemma-osu / sknnr

Datasets module #39