modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.82k stars 652 forks source link

Has anyone brought up the idea of a compliance suite for Pandas? #1915

Open itamarst opened 4 years ago

itamarst commented 4 years ago
  1. There are now five different projects with Pandas-like APIs (maybe more?): Pandas, Modin, Dask, cuDF, Koalas.
  2. The Pandas API is huge.
  3. Much of the existing Pandas test suite could probably be adapted so that it could run against different implementations. This would allow checking compliance of all the different libraries.

There are clear issues with this (e.g. bugs in Pandas—how would they be handled?) but I suspect they would be fixable.

Is this something Modin would be interested in? Any idea if anyone has discussed this before?

itamarst commented 4 years ago

Indeed they have: https://data-apis.org/blog/announcing_the_consortium/

devin-petersohn commented 4 years ago

@itamarst I am in that consortium and the goal is slightly different than what you describe. At the current iteration, it is more about library consumers of dataframes rather than end users. There is a vast difference in the capabilities for Dask and Modin, for example, so in the context of the consortium, we wanted to create a way for libraries to accept all of the types without having to individually case each one.

I think what you describe would be beneficial, because I often see many people say "such and such library does cover the full pandas API, but it gets pretty close". It would be good to have something more precise that can be evaluated instead of having to take anyone's word for it.

We have something that evaluated Modin's API against pandas using dir and then checking each parameter and parameter default against pandas, that would be easy to adapt to more libraries (though it's quite hacky at the moment). Beyond that it would be good to execute a test suite to evaluate the results are identical (and in the right order).

I'm going to reopen this to track it. Let me know if you're interested in such a suite.

itamarst commented 4 years ago

Ah, I see.

Some more context:

This idea came to me when looking at DataFrame.__setitem__. In theory it often is supposed to work by falling back to Pandas. In practice, that was broken; my PR you just merged (thank you!) might have fixed it for some of the additional cases, but then again might not. So my guess is there are other places in Modin where e.g. the Pandas fallback doesn't work, just because the Pandas API is so massively huge and complex.

I am working on Modin as consulting work on behalf of an organization (G-Research) that is interested in speeding up Pandas code, so they want (A) to have things in good working order and fast for things they use and (B) have good relationship+knowledge for when there's bug fixes needed. I understand you talked to Alex Scammon a month ago or so?

So, yes, they are interested in having me work on such a suite, and more broadly on a personal level I think it would be great if all the Pandas-like libraries benefited from it.

As a first pass, my thought was that the Pandas test suite would be a reasonable starting point. So as a prototype I'm going to try to take some pandas test modules, and see if they can be generalized to run multiple libraries, Pandas and Modin to begin with. If that seems workable we can discuss next steps (versioning and change over time seems like one big issue; Dask support seems like another, although I imagine it's not your particular concern, unless/until Modin ever exposes an publicly lazy API).

Does that seem reasonable? Do you have other ideas for approaching this?

devin-petersohn commented 4 years ago

I understand you talked to Alex Scammon a month ago or so? ... So, yes, they are interested in having me work on such a suite, and more broadly on a personal level I think it would be great if all the Pandas-like libraries benefited from it.

Yes. I think that it would be good for the overall community as well. There's a lot of opaqueness to users coming from Pandas, and asking "will it run?" should be a simple answer in an ideal world.

Does that seem reasonable? Do you have other ideas for approaching this?

The Pandas test suite tests very small dataframes. The only reason I haven't tried this for Modin CI is because we need to test partitioning and communication on larger dataframes, otherwise the computation is mostly trivial. For this use-case, it makes more sense. Since the tests are so small, the overheads of IPC and RPCs will dominate and Modin will take a lot longer on those tests since there are so many.

Modin will likely not expose a publicly lazy API for a while at least. Dask support is interesting to me from a functional coverage comparison perspective, but it does have some different semantics in terms of ordering, which might make correctness evaluation difficult.

itamarst commented 4 years ago

I guess large datasets are more likely to encounter Modin bugs, so definitely useful to have too, but perhaps that's phase 2 or 3 or 5. Presumably that is at least somewhat covered by the Modin test suite, as opposed to API coverage.

So: I will start prototyping with just Pandas and Modin, using some bit of the Pandas test suite as a basis, and see how it goes.

itamarst commented 4 years ago

Started work at https://github.com/pythonspeed/pandas-compliance-suite-prototype, initially just writing a design-as-verb document.

itamarst commented 4 years ago

OK, I have expanded it with some initial implementation ideas. I am now leaning towards using type annotation analysis, rather than the the Pandas test suite. Your feedback would be very welcome!

itamarst commented 4 years ago

Here is my specific proposal:

Proposal

1. Pandas adds highly-specific full-coverage type annotations

By highly-specific, I mean e.g. using @override to specify multiple different input/output pairs, using types like Index[bool] if a boolean Index is special, etc..

This is a bunch of work, but it's a good documentation practice and will also benefit people who are just using Pandas. So seems like an easy sell.

2. Modin etc. add highly-specific full-coverage type annotations

  1. Modeled on Pandas.
  2. Careful to only add type annotations for things that are actually tested.

This is a bunch of work, but it's a good documentation practice and will also benefit people who are just using Modin.

3. Users of Pandas can use static analysis (mypy etc.) to validate that switching to Modin will work

Simply by having items 1 and 2, switching import from Pandas to Modin will allow type checking if APIs are compatible.

No additional work needed by maintainers.

4. Modin etc. can optionally have a runtime checking mode for users attempting to switch from Pandas to Modin.

Static analysis may be difficult for some users.

So using e.g. typeguard, maintainers of Modin etc. can enable runtime checking with some sort of API flag or environment variable, so there's no cost by default.

This is a small amount of work, probably.

5. A new tool can generate diffs between type annotations for Pandas and type annotations for Modin etc., for documentation purposes and for MAINTAINER-GOAL-ADDRESS-INCOMPAT.

This will require some software development, but seems like a nicely scoped project.

devin-petersohn commented 4 years ago

@itamarst That seems like a very useful utility. There is a lot of undocumented behavior in pandas and Modin that just floats around in maintainer heads. This would either force the documentation of a particular behavior or force its removal.

itamarst commented 4 years ago

So I guess next step is socialize this idea with the Pandas devs, since would need some buy-in from them to add/keep maintained the type annotations. Do you have some good relationships there? Or do you think the data APIs consortium would be a good starting point? Otherwise I can see how G-Research is on contacts.

devin-petersohn commented 4 years ago

I do have a good relationship with them, I can reach out to a couple of them. The type annotations should not add any burden to the maintenance of pandas, so hopefully it should be too much of an issue. I will see if someone there can comment here.

itamarst commented 4 years ago

Hey, have you heard anything back re Pandas type annotations?

devin-petersohn commented 4 years ago

Yes, there's no overall issue for type hints. I pointed @TomAugspurger here, I think he will comment when he has the chance.

They do want to add type hints, the closest issue for this in the pandas issue tracker is here: pandas-dev/pandas#28142

itamarst commented 4 years ago

@TomAugspurger I'd be happy to do a quick video call about this too, since it's higher bandwidth.

TomAugspurger commented 4 years ago

Pandas has an ongoing effort to add types to the public API, though I'm not following it closely. In general, the idea of matching types seems sensible. It's like a stricter version of verifying that the function signatures match.

One question: pandas types return -> pandas.DataFrame, whereas other libraries would return their own DataFrame implementation. Would the new tool mentioned in https://github.com/modin-project/modin/issues/1915#issuecomment-680878942 handle that translation? Or would pandas' types need to return some sort of protocol type?

itamarst commented 3 years ago

@TomAugspurger the idea is that as a user, I just change my import from import pandas as pd to from modin import pandas as pd, and now the type checker is using Modin's types, which are a semantically compatible subset. Whether it's protocol or not is somewhat orthogonal, hopefully, because constructors will always return the import-specific type.

itamarst commented 3 years ago

So after some research it seems like type annotations, while definitely worth pursuing, are probably not a good short-term answer. So after some discussion, I wrote up another approach that might be helpful in the short term (and also helpful to pure Pandas-only users).

devin-petersohn commented 3 years ago

Thanks @itamarst, it seems much more doable.

A couple of questions: 1.) Would the utility in this proposal be in Modin? 2.) How will this solve weird edge cases for pandas? e.g. #2239, these are where I don't know if it will be possible to statically check because they rely on consistent behavior in pandas.

itamarst commented 3 years ago
  1. I'm going to work on it separately, because it has additional use cases, e.g. "can I upgrade from Pandas X to Pandas Y". Started dumping more detailed design notes into https://github.com/G-Research/bearcat/blob/main/DESIGN.md.

  2. Beyond a certain point, there's going to have to be custom comparison logic, and room for human discretion. For example, minor floating point differences might not be something a user cares about, or as you said cases where Pandas is just strange or wrong. So the goal isn't really pass/fail, more like giving the user the information they need to make follow-up decisions ("close enough, we can switch" or "I should use a different API for better compat" or "something is wrong with new results" or even "something is wrong with old results").

itamarst commented 3 years ago

https://github.com/g-research/bearcat now has a tracing-based prototype, demonstrating some of the many many edge cases that need to be handled.