Open PGijsbers opened 3 years ago
I fully agree to this suggestion. Indeed, dataframes were added later and have matured substantially since then and I guess we can consider handling them standard knowledge for machine learners.
I took a look at this to see how the proposed changes would behave and it appeared to lead to a few issues with some functions not fully integrated with dataframes yet. Many of these errors were Attribute errors centering around flow_id, task_id, and run_id, and appear to be the result of some functions expecting objects other than dataframes. It also lead to repeated unit test failures involving Key_Errors, namely with flow_id, task_id, and uploader. There were also some assertion errors that wanted to assert some object was a dict, but I imagine these assertions would be irrelevant with dataframes as the default. I have a complete list of unit tests that failed after implementing these changes that I can provide as well, but I think there is still more integration with dataframes needed from the codebase before being able to totally implement this change.
In that case it might be best to do this incrementally. First we'd get a PR that changes the default, but makes internal calls use the explicit output_format="dict"
so you can change the external behavior without making many internal changes. We could then further refactor the internal handling of the data in separate PRs where it makes sense.
Hi!
I am not sure how I ended up here, but I looked at what you guys are doing and It's pretty remarkable. I find this challenge to be my starter in this project and would love to contribute more. =)
Hi @varshneydevansh, thank you! We already started working on this, as can be seen in #1258. It's already assigned to me since it's pretty involved. However, if you would like to do this instead, here is basically what needs to be done:
The next step is to actually make the change and make output_format='dataframe'
the default as well as removing output_format='dict'
support wherever the message indicated it. You should be able to find all instances of this in # TODO: [0.15]
. Then, make sure to update any unit tests necessary and update the documentation.
If you'd like something smaller, I could recommend you to work on #1142 or #1213 (the latter requires some Github Actions knowledge, but no knowledge of our code-base).
We missed this in the 0.15 release, adding it to 0.16
I propose that the several functions that take the
ouput_format
parameter and produce lists change their defaults to return dataframes instead of dictionaries (e.g.list_datasets
). I think originally we had loose integration with pandas and it was an optional dependency, but since it's a required dependency now I don't see a reason to providedicts
as default overdataframes
. I'd argue most people prefer dataframes in interactive sessions, and it makes sense to minimize the amount of writing people have to do in those (also plenty of people might prefer dataframes in scripts too).