Open RicardoJSandoval opened 2 years ago
Hi Ricardo, this is a hugely valuable effort. Thanks so much!
I just had a look at the zip file that you link. This generally all looks good. There are a few higher level points to discuss and then some suggestions and minor comments.
Higher-level points:
There's a balance to strike between the generality of the code and the needs of the typical machine learning researcher. When changing the API, we should keep this in mind. A typical ML grad student should be able to create their own data sources quickly without too much of a learning curve. Otherwise instead of using folktables, researchers will go and write their own ad-hoc code.
Concrete suggestions:
FilesResource
abstraction, collapsing download_utils.py and load_utils.py into one. DataSource(Protocol)
will have to know what the async
keyword in async def _download_data
entails? If so, this might provide a bit of an extra hurdle to creating new data sources. Minor comments:
files_resources.py
has documentation that mention SIPP. I assume these should be general. Likewise acs_data_source.py
has a few methods were SIPP shows up in the doc string.Paging @francesding @millerjohnp @ludwigschmidt since this touches on some major design choices. Please let us know if you have any comments or concerns.
Hi Ricardo,
Thanks for integrating SIPP, it will be a great addition to folktables!
A few points below:
As Moritz said, it would be good to make this a real pull request (as opposed to linking to a zip file). That makes it easier to see the code changes in the Github UI, comment on individual lines, etc.
Sharing code between the ACS download routines and SIPP is great. What I don't fully understand right now is how a future contributor should use these building blocks. For instance, what is the purpose of _load_data
and _download_data
in the DataSource
interface? If these functions can be called from code external to DataSource
, why do the names start with an underscore? If the functions are only for use internal to DataSource
implementations, why are they part of the interface?
Asynchronous downloads are great to increase download speed if all the resulting complexity is handled in the DataSource API. I think users of DataSource should not have to know about asynchronous Python since it can be a bit complicated.
The utils
folder should contain only code that is independent of a specific data source. The documentation string in files_resources.py
says "Stores information needed to (down)load the SIPP dataset." Should this be in one of the SIPP files?
Re-using code is good, but there is also a cost to creating abstractions. For instance, I'm not sure if we need the load_json
function in load_utils.py
. The amount of code we save is only one line, and the two lines in the function for loading a JSON are relatively easy to understand. Introducing a new function makes the code harder to read for others who don't know exactly what the function is doing but would understand the two common Python lines for loading a JSON.
Hi, thanks for the proposal and initial changes! Beyond what others have mentioned, just want to understand if this is targeting a new dataset or collection of datasets? Or is there a particular part of the current interface you find confusing or annoying?
I think it'd be easier to reason about the costs/benefits of interface changes if there's actually a concrete target. Additionally, I like the lightweight interface, and I think it only really makes sense to add a lot of additional complexity and spend dev time implementing and maintaining something it if we're actually getting a new resource or improving quality of life for users.
Hi all,
Thanks for all the feedback and comments. I'll address all of them here.
The main idea behind the changes I've made is to keep the user facing API intact while changing the way in which the implementation is done. This is in order to make it easier to integrate other datasets in the future. Hence, the example you all list in the README would still be valid:
data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')
acs_data = data_source.get_data(states=["CA"], download=True)
The way to use the SIPPDataSource
is very similar, the only difference being the parameters you have to pass to the constructor and to get_data
.
I started to work on these changes when I started implementing the functionality to (down)load the SIPP dataset. I found a lot of the existing code to be useful but couldn't access it for my purposes as it was tailored for the ACS dataset. Hence, I went ahead and refactored the existing code to make it more modular and reusable.
FilesResource
You all are right; SIPP shouldn't be mentioned at all in files_resource.py
as FilesResource
is agnostic of any dataset/datasource. Rather, it's just supposed to contain information that till be useful to (down)load any dataset. I'll go ahead and remove any mention of the SIPP dataset from this file.
I originally thought of having the DownloadResource
and LoadResource
abstractions as they help separate the behavior of the core functionalities the DataSource
provides. However, I don’t mind getting rid of them and just assigning all the fields to FilesResource
.
Yes, the idea is that all asynchronous downloads are handled internally and that the implementation details are hidden from the user. You can use the ACSDataSource
without having to worry about what async
and await
entails. This comment made me rethink the design and I think it would probably be best to keep the DataSource
objects free from any async
code and push it all (if we end up deciding to do so) to utils.download_utils.py
.
There is, however, a caveat with asyncio
’s to_thread
coroutine: it was introduced in Python's version 3.9, which means that it might limit some backwards compatibility.
DataSource
’s _load_data
and _download_data
My idea of including _load_data
and _download_data
in DataSource
was to make it explicit that these are the two core functionalities that DataSource
objects provide (and it also made it easier for me to divide the code into these functions). However, we can certainly remove them from this protocol class as they don’t really add much.
load_json
This is a good point, Ludwig. I can go ahead and remove this function and directly use the JSON loading code in the rest of the code.
Following what Moritz and Ludwig proposed, I’ll go ahead and break this up into a couple of pull requests:
utils
directory as I think this is the easiest to reason about, since it doesn’t depend on any other code in folktables
and it’s the building block for future pull requests (it also won’t affect any of the code you all currently have).utils
directory, I can go ahead and push the code for DataSource
and the refactored ACSDataSource
.SIPPDataSource
.Please let me know what you all think about this plan and if you all have any other questions!
SGTM - Thanks so much!
Sounds great to me! Re: asynchronous downloads, it might be nice to maintain some flexibility with python versions. concurrent.futures
might be one alternative to asyncio
Regarding Python versions: apart from the async
part, the switch from ABC
to Protocol
also requires at least Python 3.8 because Protocol
was introduced then. I'm not sure what our policy on Python versions should be. According to our unit tests, we currently support 3.7 onwards. Python 3.8 was released in October 2019, so it may well be that it's widely used by now and doesn't cause trouble for users of folktables. Any thoughts?
Also just to clarify my comment regarding _load_data
and _download_data
. If we have a use case where code external to DataSource
needs to call these functions, we can certainly introduce them to the interface. Based on my understanding of the code right now, the two functions are only called by DataSource
objects internally. If that's the case, we don't need to add the functions to the externally visible interface. As John mentioned, it's nice to have simple interfaces and hide the complexity inside implementations. But I may also just be misunderstanding the re-organized code right now :-)
If we have a use case for extending the DataSource
interface to include the two new functions, it would be good to rename the functions and remove the leading underscore. Python convention is to use leading underscores for internal function names, e.g., see https://stackoverflow.com/questions/1301346/what-is-the-meaning-of-single-and-double-underscore-before-an-object-name .
Thank you for making the changes!
@ludwigschmidt You are right, my idea for the _load_data
and _download_data
methods is for them to only be called internally by DataSource
objects. I'll go ahead and remove them from the externally visible interface.
I'll also go ahead and revert back to using ABC
instead of protocol in order to still be able to support Python 3.7, and will switch from using asyncio
to concurrent.futures
.
I'll go ahead and send the first pull request shortly.
Thanks Ricardo for this work! The revised changes sound good to me.
Thanks, Frances
On Mon, Aug 1, 2022 at 2:17 PM Ricardo Sandoval @.***> wrote:
@ludwigschmidt https://github.com/ludwigschmidt You are right, my idea for the _load_data and _download_data methods is for them to only be called internally by DataSource objects. I'll go ahead and remove them from the externally visible interface.
I'll also go ahead and revert back to using ABC instead of protocol in order to still be able to support Python 3.7, and will switch from using asyncio to concurrent.futures.
I'll go ahead and send the first pull request shortly.
— Reply to this email directly, view it on GitHub https://github.com/zykls/folktables/issues/26#issuecomment-1201731556, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACLSMAFFVDYDHONZZKHJEE3VXA5HVANCNFSM543MMQ3Q . You are receiving this because you were assigned.Message ID: @.***>
Hi! I'm proposing a refactoring of folktables's
DataSource
component (refer to this ZIP to see the changes already implemented).Proposed changes
In this section I describe the changes I'm proposing that will affect the structure of the
DataSource
component.DataSource
from anABC
to aProtocol
: as I see it, the role ofDataSource
is to define an interface for classes that help with the (down)loading of different datasets. By havingDataSource
be aProtocol
we'll be taking advantage of Python's duck typing and will make it more straight forward that this is the interface that should be followed (plus, we have the added benefit that we do not have to directly inherit fromDataSource
whenever defining this component for a different dataset).DownloadResource
,LoadResource
, andFilesResource
: these aredataclasses
that contain information pertaining to file paths and URLs for the different datasets the user wants to (down)load. These objects (note thatFilesResource
just contains instances ofDownloadResource
andLoadResource
) makes it easier to keep track of the resources needed to (down)load files.load_acs.py
and theACSDataSource
class inacs.py
into one file: in this refactoring I'm following the re-definedDataSource
interface, I'm using the utility functions I created or refactored, and I'm taking advantage of the*Resource
objects defined. One additional thing I did was to turn the hard coded URL into a "constant" variable.Proposed file structure