A lot of shared code with pandas backend?

zbrookle commented 4 years ago

So part of the reason I suggested adding dask originally (in ibis-project/ibis#2245) was because of the amount of code that they both share, in terms of the dataframe api. I'm a bit surprised to see this particular backend in a different repo when there's so much opportunity for using identical code to what's already been written for pandas

jreback commented 4 years ago

hmm this is a very good point

separating is meant for hard to test packages that might need a lot of ci resources and have differentl source communities

eg spark, omnisci for example

but dask is very lightweight and shares a lot with pandas

datapythonista commented 4 years ago

I think we've already got a working implementation of the dask backend. I agree that ideally we want to reuse as much code as possible. But I'd merge what we've got now in this repo, and then perform any refactoring until we reuse all the code that it makes sense.

zbrookle commented 4 years ago

@datapythonista I'm not saying to scrap all the code that was written, but I think in terms of refactoring, it's going to be a lot harder to refactor something when the code is in a separate repo. It actually looks like the only code that's even in this repo's pull request is the CI code

datapythonista commented 4 years ago

@datapythonista I'm not saying to scrap all the code that was written, but I think in terms of refactoring, it's going to be a lot harder to refactor something when the code is in a separate repo. It actually looks like the only code that's even in this repo's pull request is the CI code

Thanks for the comment @zbrookle. I'm surely open to move the Dask code into the core of Ibis if it makes things easier. But I'm not sure it's really going to make things easier, or the opposite. I think changes will be much faster outside Ibis. Before we publish the backend, we don't need to get crazy with reviews, and keeping very high standards for the backend. And I also think that having two separate projects should make things clearer, on how the pandas and dask backend depend on the pandas one.

Having it in a separate repo it also opens the door to start a new dask backend from scratch, if that was a good option.

I'd say let's move forward creating the backend we've got in this repo, and if at some point having two separate repos makes things complicated, let's move the code to the Ibis core. Does that sounds good?

jreback commented 4 years ago

i am not sure lack of review is a good thing

rethinking this a bit

dask and pandas backends likely share a lot of code

so these don't make sense to be separate at all

datapythonista commented 4 years ago

From past experience (Arrow and MSSQL backends), having a slow review process meant that they created the backends elsewhere. Now we have a Dask backends that afaik is working, but doesn't reuse as much code as it could. This is a good first step IMO. I understand that having a smaller backend that reuses code would be better, but we don't have it now, and afaik nobody is working on it. So, the first decision to make is:

Having the already implemented backend in the Ibis core
Having it in the ibis organization
Not have it anywhere, and let the author have it in their own GitHub

Regarding reusing more code, the main advantage I see is being able to open PRs that touch both the pandas and the dask backend together. But I don't see that being important, or even desirable. Having first a PR to the pandas backend, where the code is cleaned, so it can be better reused by other backends seems probably even better.

For the rest, reusability will be calling its functions, or subclassing its classes, I don't see it makes a difference importing the pandas backend whether the Dask backend is in the same library or in a separate one.

Feel free to reopen the Dask backend PR in the Ibis core if you think it's better, but I'd personally be more dynamic. Get the Dask backend we've got, and then keep iterating. Also, having it in a separate repo would make it easier if eventually someone wants to write another Dask backend that reuses lots of the pandas backend code.

zbrookle commented 4 years ago

@jreback @datapythonista I think that a third option that you may want to consider is having a repo for all dataframe based backends. Then you can maintain the shared code in one place, and also only worry about maintaining CI for those backends which will have similar tests. I don't know if Modin and rapids/cudf are in the scope of this project but it would make sense to me if they were and they may have also have shared code with pandas/dask. I'd be happy to work on this if it's something that would be useful

gerrymanoim commented 4 years ago

I'm personally happy to reopen my PR (https://github.com/ibis-project/ibis/pull/2400) in either place, whatever is ultimately easier for the ibis org (though I definitely want to go through and clean up/fix parts of the code I know are broken before asking for a code review. It also might be easier if I incrementally open PRs for each functionality in execution instead of doing a massive PR with everything).

As for overlap between the pandas and dask backend (in case this is helpful):

A large percentage of the tests were fairly similar, with the dask tests comparing dask dataframes (i.e. assert result.compute() == expected.compute() where result was executed via ibis and expected is done in dask or via dd.from_pandas). Similarly my conftest.py returned dask objects.
I think there will be major differences for any code that requires sorting (AFAIK dask only allows for sorting on a single column only, via set_index).
There will be also major difference with the grouping code. Some of the pandas group by code relies on access to the grouper and obj objects, which I don't believe we have access to in dask. There were also places I had to write my own dd.Aggregation functions.
Similarly, I needed to provide meta in some places, and there's currently a bunch of warnings about unspecified meta that should be cleaned up.
There are some places I had to use dd.array instead of numpy.
Outside execution, I think there was a high amount of code overlap, though I haven't gotten through and fixed some non working things yet.
IIRC there were a few areas I needed to add more type dispatches.

I'm happy to go back an try to put a more detailed comparison together. That all being said, I'm relatively new to ibis, so there's definitely things I may be misunderstanding.

jreback commented 4 years ago

@jreback @datapythonista I think that a third option that you may want to consider is having a repo for all dataframe based backends. Then you can maintain the shared code in one place, and also only worry about maintaining CI for those backends which will have similar tests. I don't know if Modin and rapids/cudf are in the scope of this project but it would make sense to me if they were and they may have also have shared code with pandas/dask. I'd be happy to work on this if it's something that would be useful

yes this is certainly an option.

jreback commented 4 years ago

The issue i have here is that I think its very hard for a new contributor to setup all of this infrastructure, w/o a LOT of cookie-cutter help (which itself hard to build and so IMHO not worth it).

I would prefer to build dask in current ibis.

Then if we do want to split things out, we certainly can, but let's take something that's already done (e.g. omnisci, spark etc) and do it and see how it goes. Not worth it to build a new backend in a new repo with a newer contributor, too many news.

make sense @gerrymanoim @datapythonista

@datapythonista not trying to undermine your effort at all, really really appreciate everything. I just think in this case, let's take one step at a time.

datapythonista commented 4 years ago

I'm ok having the dask backend in Ibis. But are you ok to have the implementation we have now? Ot the potential much smaller implementation? If you want the smaller implementation, my point is that I'd rather get the existing one here, than possibly end up with none.

In any case, I'm working on making it easy for backends to ve moved around. So, hopefully it's soon trivial where they are, because we should be able to move them very quickly when we have reasons to do it.

ibis-project / ibis-dask

A lot of shared code with pandas backend? #3