Open itamarst opened 4 years ago
Indeed they have: https://data-apis.org/blog/announcing_the_consortium/
@itamarst I am in that consortium and the goal is slightly different than what you describe. At the current iteration, it is more about library consumers of dataframes rather than end users. There is a vast difference in the capabilities for Dask and Modin, for example, so in the context of the consortium, we wanted to create a way for libraries to accept all of the types without having to individually case each one.
I think what you describe would be beneficial, because I often see many people say "such and such library does cover the full pandas API, but it gets pretty close". It would be good to have something more precise that can be evaluated instead of having to take anyone's word for it.
We have something that evaluated Modin's API against pandas using dir
and then checking each parameter and parameter default against pandas, that would be easy to adapt to more libraries (though it's quite hacky at the moment). Beyond that it would be good to execute a test suite to evaluate the results are identical (and in the right order).
I'm going to reopen this to track it. Let me know if you're interested in such a suite.
Ah, I see.
Some more context:
This idea came to me when looking at DataFrame.__setitem__
. In theory it often is supposed to work by falling back to Pandas. In practice, that was broken; my PR you just merged (thank you!) might have fixed it for some of the additional cases, but then again might not. So my guess is there are other places in Modin where e.g. the Pandas fallback doesn't work, just because the Pandas API is so massively huge and complex.
I am working on Modin as consulting work on behalf of an organization (G-Research) that is interested in speeding up Pandas code, so they want (A) to have things in good working order and fast for things they use and (B) have good relationship+knowledge for when there's bug fixes needed. I understand you talked to Alex Scammon a month ago or so?
So, yes, they are interested in having me work on such a suite, and more broadly on a personal level I think it would be great if all the Pandas-like libraries benefited from it.
As a first pass, my thought was that the Pandas test suite would be a reasonable starting point. So as a prototype I'm going to try to take some pandas test modules, and see if they can be generalized to run multiple libraries, Pandas and Modin to begin with. If that seems workable we can discuss next steps (versioning and change over time seems like one big issue; Dask support seems like another, although I imagine it's not your particular concern, unless/until Modin ever exposes an publicly lazy API).
Does that seem reasonable? Do you have other ideas for approaching this?
I understand you talked to Alex Scammon a month ago or so? ... So, yes, they are interested in having me work on such a suite, and more broadly on a personal level I think it would be great if all the Pandas-like libraries benefited from it.
Yes. I think that it would be good for the overall community as well. There's a lot of opaqueness to users coming from Pandas, and asking "will it run?" should be a simple answer in an ideal world.
Does that seem reasonable? Do you have other ideas for approaching this?
The Pandas test suite tests very small dataframes. The only reason I haven't tried this for Modin CI is because we need to test partitioning and communication on larger dataframes, otherwise the computation is mostly trivial. For this use-case, it makes more sense. Since the tests are so small, the overheads of IPC and RPCs will dominate and Modin will take a lot longer on those tests since there are so many.
Modin will likely not expose a publicly lazy API for a while at least. Dask support is interesting to me from a functional coverage comparison perspective, but it does have some different semantics in terms of ordering, which might make correctness evaluation difficult.
I guess large datasets are more likely to encounter Modin bugs, so definitely useful to have too, but perhaps that's phase 2 or 3 or 5. Presumably that is at least somewhat covered by the Modin test suite, as opposed to API coverage.
So: I will start prototyping with just Pandas and Modin, using some bit of the Pandas test suite as a basis, and see how it goes.
Started work at https://github.com/pythonspeed/pandas-compliance-suite-prototype, initially just writing a design-as-verb document.
OK, I have expanded it with some initial implementation ideas. I am now leaning towards using type annotation analysis, rather than the the Pandas test suite. Your feedback would be very welcome!
Here is my specific proposal:
By highly-specific, I mean e.g. using @override
to specify multiple different input/output pairs, using types like Index[bool]
if a boolean Index is special, etc..
This is a bunch of work, but it's a good documentation practice and will also benefit people who are just using Pandas. So seems like an easy sell.
This is a bunch of work, but it's a good documentation practice and will also benefit people who are just using Modin.
Simply by having items 1 and 2, switching import from Pandas to Modin will allow type checking if APIs are compatible.
No additional work needed by maintainers.
Static analysis may be difficult for some users.
So using e.g. typeguard, maintainers of Modin etc. can enable runtime checking with some sort of API flag or environment variable, so there's no cost by default.
This is a small amount of work, probably.
MAINTAINER-GOAL-ADDRESS-INCOMPAT
.This will require some software development, but seems like a nicely scoped project.
@itamarst That seems like a very useful utility. There is a lot of undocumented behavior in pandas and Modin that just floats around in maintainer heads. This would either force the documentation of a particular behavior or force its removal.
So I guess next step is socialize this idea with the Pandas devs, since would need some buy-in from them to add/keep maintained the type annotations. Do you have some good relationships there? Or do you think the data APIs consortium would be a good starting point? Otherwise I can see how G-Research is on contacts.
I do have a good relationship with them, I can reach out to a couple of them. The type annotations should not add any burden to the maintenance of pandas, so hopefully it should be too much of an issue. I will see if someone there can comment here.
Hey, have you heard anything back re Pandas type annotations?
Yes, there's no overall issue for type hints. I pointed @TomAugspurger here, I think he will comment when he has the chance.
They do want to add type hints, the closest issue for this in the pandas issue tracker is here: pandas-dev/pandas#28142
@TomAugspurger I'd be happy to do a quick video call about this too, since it's higher bandwidth.
Pandas has an ongoing effort to add types to the public API, though I'm not following it closely. In general, the idea of matching types seems sensible. It's like a stricter version of verifying that the function signatures match.
One question: pandas types return -> pandas.DataFrame
, whereas other libraries would return their own DataFrame
implementation. Would the new tool mentioned in https://github.com/modin-project/modin/issues/1915#issuecomment-680878942 handle that translation? Or would pandas' types need to return some sort of protocol type?
@TomAugspurger the idea is that as a user, I just change my import from import pandas as pd
to from modin import pandas as pd
, and now the type checker is using Modin's types, which are a semantically compatible subset. Whether it's protocol or not is somewhat orthogonal, hopefully, because constructors will always return the import-specific type.
So after some research it seems like type annotations, while definitely worth pursuing, are probably not a good short-term answer. So after some discussion, I wrote up another approach that might be helpful in the short term (and also helpful to pure Pandas-only users).
Thanks @itamarst, it seems much more doable.
A couple of questions: 1.) Would the utility in this proposal be in Modin? 2.) How will this solve weird edge cases for pandas? e.g. #2239, these are where I don't know if it will be possible to statically check because they rely on consistent behavior in pandas.
I'm going to work on it separately, because it has additional use cases, e.g. "can I upgrade from Pandas X to Pandas Y". Started dumping more detailed design notes into https://github.com/G-Research/bearcat/blob/main/DESIGN.md.
Beyond a certain point, there's going to have to be custom comparison logic, and room for human discretion. For example, minor floating point differences might not be something a user cares about, or as you said cases where Pandas is just strange or wrong. So the goal isn't really pass/fail, more like giving the user the information they need to make follow-up decisions ("close enough, we can switch" or "I should use a different API for better compat" or "something is wrong with new results" or even "something is wrong with old results").
https://github.com/g-research/bearcat now has a tracing-based prototype, demonstrating some of the many many edge cases that need to be handled.
There are clear issues with this (e.g. bugs in Pandas—how would they be handled?) but I suspect they would be fixable.
Is this something Modin would be interested in? Any idea if anyone has discussed this before?