Closed aazuspan closed 1 year ago
If we wanted to maximize compatibility, we could allow dictionary access (e.g. dataset["features"]) by implementing a
__getitem__
method, but I'm not sure if that's worth the trouble for what's probably a rare use case.
Personally, I don't think it's worth the trouble. I think dot syntax is much more intuitive and I never use dictionary access with the load_*
functions from sklearn.
This adds an ids attribute that stores the plot IDs for the dataset. @grovduck, let me know whether you think this is worth including. Another option would be to just have users use dataframes if they want the IDs.
I want to make sure I understand the implications here. I don't think there is any reference to the ids
attribute anywhere in the codebase besides the testing and that attribute would be specific to the datasets._base.Dataset
class, right? My biggest concern (and it may not be that consequential) is that users would see this ids
attribute associated with the test data and wonder why their data would not have the same (if not imported in the same way). I think it provides some flexibility for users to work with arrays to instantiate Dataset
objects with the flexibility for converting to dataframes if wanted. So I guess I'd vote to keep the ids
attribute in here in you don't think it will be too surprising to users.
I want to make sure I understand the implications here. I don't think there is any reference to the ids attribute anywhere in the codebase besides the testing and that attribute would be specific to the datasets._base.Dataset class, right?
Right! Essentially, it would just be a way to let users access the plot IDs without using dataframes.
My biggest concern (and it may not be that consequential) is that users would see this ids attribute associated with the test data and wonder why their data would not have the same (if not imported in the same way). I think it provides some flexibility for users to work with arrays to instantiate Dataset objects with the flexibility for converting to dataframes if wanted. So I guess I'd vote to keep the ids attribute in here in you don't think it will be too surprising to users.
I agree with all of that! One last question: do you like ids
for the attribute name or should I change it to index
or something else? index
would be more consistent with pandas
and might make the connection with return_dataframe_index
in kneighbors
more clear. Or we could go more specific with something like plot_id
?
I agree with all of that! One last question: do you like
ids
for the attribute name or should I change it toindex
or something else?index
would be more consistent withpandas
and might make the connection withreturn_dataframe_index
inkneighbors
more clear. Or we could go more specific with something likeplot_id
?
index
is totally fine by me - I like the connection to pandas
and return_dataframe_index
. plot_id
is probably way too narrow of a name for other applications (if there is such a thing)!
Another one down! Thanks for the helpful reviews :)
Nice job! I'm definitely feeling a bit guilty on how much work you're doing, but I'm very grateful.
Don't be! This is fun enough that I barely consider it work.
And I think we'll be leaning on you for a lot of the upcoming features like dimensionality reduction and RF-NN :wink:
Yep, hope to be developing again next week. Devoting a couple of days to it without interruption!
This PR would close #3 by adding a
datasets
module for loading example data (currently justload_moscow_stjoes
).This dataset implementation follows the API of sklearn datasets very closely. All of the parameters to the the load function work identically, and all attributes present in an sklearn dataset are matched in our dataset. The differences are:
Dataset
class rather than a modified dictionary. For people using dot-notation to access dataset attributes, this change should be transparent with the exception of better type hints and auto-completion and a more legible repr.dataset["features"]
) by implementing a__getitem__
method, but I'm not sure if that's worth the trouble for what's probably a rare use case.kneighbors
in #37ids
attribute that stores the plot IDs for the dataset. @grovduck, let me know whether you think this is worth including. Another option would be to just have users use dataframes if they want the IDs.This PR also refactors most of the tests to use
load_moscow_stjoes
to access test data. For the porting tests, it adds aload_moscow_stjoes_results
function that returns the dataset with results from yaImpute appended to it. While I was making that change, I also parameterized the port tests to reduce duplication.