Open zbrookle opened 4 years ago
hmm this is a very good point
separating is meant for hard to test packages that might need a lot of ci resources and have differentl source communities
eg spark, omnisci for example
but dask is very lightweight and shares a lot with pandas
I think we've already got a working implementation of the dask backend. I agree that ideally we want to reuse as much code as possible. But I'd merge what we've got now in this repo, and then perform any refactoring until we reuse all the code that it makes sense.
@datapythonista I'm not saying to scrap all the code that was written, but I think in terms of refactoring, it's going to be a lot harder to refactor something when the code is in a separate repo. It actually looks like the only code that's even in this repo's pull request is the CI code
@datapythonista I'm not saying to scrap all the code that was written, but I think in terms of refactoring, it's going to be a lot harder to refactor something when the code is in a separate repo. It actually looks like the only code that's even in this repo's pull request is the CI code
Thanks for the comment @zbrookle. I'm surely open to move the Dask code into the core of Ibis if it makes things easier. But I'm not sure it's really going to make things easier, or the opposite. I think changes will be much faster outside Ibis. Before we publish the backend, we don't need to get crazy with reviews, and keeping very high standards for the backend. And I also think that having two separate projects should make things clearer, on how the pandas and dask backend depend on the pandas one.
Having it in a separate repo it also opens the door to start a new dask backend from scratch, if that was a good option.
I'd say let's move forward creating the backend we've got in this repo, and if at some point having two separate repos makes things complicated, let's move the code to the Ibis core. Does that sounds good?
i am not sure lack of review is a good thing
rethinking this a bit
dask and pandas backends likely share a lot of code
so these don't make sense to be separate at all
From past experience (Arrow and MSSQL backends), having a slow review process meant that they created the backends elsewhere. Now we have a Dask backends that afaik is working, but doesn't reuse as much code as it could. This is a good first step IMO. I understand that having a smaller backend that reuses code would be better, but we don't have it now, and afaik nobody is working on it. So, the first decision to make is:
Regarding reusing more code, the main advantage I see is being able to open PRs that touch both the pandas and the dask backend together. But I don't see that being important, or even desirable. Having first a PR to the pandas backend, where the code is cleaned, so it can be better reused by other backends seems probably even better.
For the rest, reusability will be calling its functions, or subclassing its classes, I don't see it makes a difference importing the pandas backend whether the Dask backend is in the same library or in a separate one.
Feel free to reopen the Dask backend PR in the Ibis core if you think it's better, but I'd personally be more dynamic. Get the Dask backend we've got, and then keep iterating. Also, having it in a separate repo would make it easier if eventually someone wants to write another Dask backend that reuses lots of the pandas backend code.
@jreback @datapythonista I think that a third option that you may want to consider is having a repo for all dataframe based backends. Then you can maintain the shared code in one place, and also only worry about maintaining CI for those backends which will have similar tests. I don't know if Modin and rapids/cudf are in the scope of this project but it would make sense to me if they were and they may have also have shared code with pandas/dask. I'd be happy to work on this if it's something that would be useful
I'm personally happy to reopen my PR (https://github.com/ibis-project/ibis/pull/2400) in either place, whatever is ultimately easier for the ibis org (though I definitely want to go through and clean up/fix parts of the code I know are broken before asking for a code review. It also might be easier if I incrementally open PRs for each functionality in execution
instead of doing a massive PR with everything).
As for overlap between the pandas and dask backend (in case this is helpful):
assert result.compute() == expected.compute()
where result was executed via ibis and expected is done in dask or via dd.from_pandas
). Similarly my conftest.py
returned dask objects. set_index
). grouper
and obj
objects, which I don't believe we have access to in dask. There were also places I had to write my own dd.Aggregation
functions. meta
in some places, and there's currently a bunch of warnings about unspecified meta
that should be cleaned up. dd.array
instead of numpy. execution
, I think there was a high amount of code overlap, though I haven't gotten through and fixed some non working things yet. I'm happy to go back an try to put a more detailed comparison together. That all being said, I'm relatively new to ibis, so there's definitely things I may be misunderstanding.
@jreback @datapythonista I think that a third option that you may want to consider is having a repo for all dataframe based backends. Then you can maintain the shared code in one place, and also only worry about maintaining CI for those backends which will have similar tests. I don't know if Modin and rapids/cudf are in the scope of this project but it would make sense to me if they were and they may have also have shared code with pandas/dask. I'd be happy to work on this if it's something that would be useful
yes this is certainly an option.
The issue i have here is that I think its very hard for a new contributor to setup all of this infrastructure, w/o a LOT of cookie-cutter help (which itself hard to build and so IMHO not worth it).
I would prefer to build dask in current ibis.
Then if we do want to split things out, we certainly can, but let's take something that's already done (e.g. omnisci, spark etc) and do it and see how it goes. Not worth it to build a new backend in a new repo with a newer contributor, too many news.
make sense @gerrymanoim @datapythonista
@datapythonista not trying to undermine your effort at all, really really appreciate everything. I just think in this case, let's take one step at a time.
I'm ok having the dask backend in Ibis. But are you ok to have the implementation we have now? Ot the potential much smaller implementation? If you want the smaller implementation, my point is that I'd rather get the existing one here, than possibly end up with none.
In any case, I'm working on making it easy for backends to ve moved around. So, hopefully it's soon trivial where they are, because we should be able to move them very quickly when we have reasons to do it.
So part of the reason I suggested adding dask originally (in ibis-project/ibis#2245) was because of the amount of code that they both share, in terms of the dataframe api. I'm a bit surprised to see this particular backend in a different repo when there's so much opportunity for using identical code to what's already been written for pandas