h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.82k stars 157 forks source link

Support for apache arrow. #1232

Closed AnthonyJacob closed 3 years ago

AnthonyJacob commented 6 years ago

Is there any reason why you did not go with apache arrow format from the beginning?

It would be at least nice, if you allowed to_arrow_table and from_arrow_table conversions.

st-pasha commented 6 years ago

Yes, we seriously considered Arrow (formerly Feather) as the main backend format for storing data on-disk. However, after careful consideration, we had to conclude that the format could not meet our needs:

In terms of saving datatable Frames into or loading from feather files, this feature was included in our original roadmap (#290), and it still remains there.

AnthonyJacob commented 6 years ago

No support for arrays with ≥2^31 elements, which means that datasets with over 2B rows cannot be stored contiguously.

Isn't that a feature actually? You won't be usually able to fit such a large data into RAM anyway, and it also makes some operations faster, since its easier to update 1 of 10 2B arrays instead of whole 20B one.

Most importantly, you should want to make array methods run in constant memory (similar to spark). This specification makes it obvious, so you don't have to repeat pandas mistakes.

No support for in-place NA values (R-style). Without this we would not be able to memmap data from disk directly, and lose the ability to work with datasets that do not fit into RAM.

That depends where are you getting the data from. If its your pipeline, that saves the data from sensors, then you can save them to apache arrow directly. If its pandas dataframe / spark / parquet, which use apache arrow, then you again already have apache arrow dataframe.

If its another source, then you truly have to preproccess them manually, but I am not able to estimate how relevant this scenario is.

jangorecki commented 6 years ago

Uncompressed 2B elements of double precision is < 15GB. So on high end machine which has 2TB you can fit more than 270B elements. OK not every one has 2TB, but it is better to be future proof, and not after a year or two, re-implement this.

By asking for update 1 of 10 2B arrays you basically asking for partitioning, which could be implemented anyway. If you do vector scan of all 10 arrays I believe it will be slower than single 20B vector scan, eventually easier to parallelize.

st-pasha commented 6 years ago

@AnthonyJacob Suppose I have an array of 10B doubles. That's 80GB of memory, more than my laptop can handle. But if that array lives on a hard drive in a Jay file, I simply memory-map the entire file, and now it behaves as-if it was loaded into RAM. But it wasn't. The data is still on disk, and the RAM is free. If you need to update a single value somewhere in the middle of that array, the operating system will load just 1 page (4KB) of data into memory, update that value, and then save it back to disk. (This is all assuming we opened the file in read-write mode).

wesm commented 6 years ago

No support for arrays with ≥2^31 elements, which means that datasets with over 2B rows cannot be stored contiguously.

We support arrays > 2^31 elements. This was changed a pretty long time ago. There might be a document here or there which still says that lengths are limited to int32_t, we just need to fix that.

No support for in-place NA values (R-style). Without this we would not be able to memmap data from disk directly, and lose the ability to work with datasets that do not fit into RAM.

This isn't true -- memory mapping still works fine, but the null bits and the values are in different memory regions.

wesm commented 6 years ago

For what it's worth, I have reviewed pydatatable's bespoke columnar in-memory format and don't see reasons why you could not use Arrow as your primary runtime memory format (in-memory and memory-mapped on-disk). There are some minor things like string values with > 2B length (there is an open issue about adding a "large binary" type with 64-bit offsets). It's a bit too bad as we are developing a lot of software that could be utilized natively in this project without need for any data marshalling.

This would be a valuable discussion to have as an open source community; if there's something that the Arrow community could build to be of service, it would be good to know.

st-pasha commented 6 years ago

@wesm Hi Wes, thanks for taking interest in this discussion. You (and the OP) raise several questions, which are rather similar to each other, but the answers are quite different.

1. Why did we not go with Apache Arrow from the beginning? The beginning was more than a year ago. Arrow was called Feather back then, it was younger and had fewer features (at least less than we needed). So we decided to create our own, extremely simple format (called NFF). Several months ago we came to the conclusion that NFF is not really good enough, and needs an upgrade. At that point, we reviewed Arrow once again. It was much more mature and feature-rich, but still couldn't quite fulfill all our needs. So we reshaped NFF into Jay, and that's what we use right now.

2. Could we use Arrow as the primary in-memory and on-disk format? It is great to know that the 32-bit limitation on array length in Arrow is long gone, and it is merely a documentation bug that it states otherwise. And I'm sure the similar limit on the total size of string data in a column will also be resolved soon. However, the major remaining obstacle is the storage of NAs. In datatable we use "R-style" NAs: a certain specific value is designated as NA (e.g for int32 it is -2**31); in Arrow, you use the "masked array style" where a separate bitmask controls how a value is to be treated. Both approaches have their advantages and downsides. But clearly, in order for Arrow to become the primary storage format, one of the two things need to happen:

3. Could we at least allow conversions from/to Arrow Table? Absolutely! We already support conversions from/to pandas DataFrames and numpy arrays, so supporting arrow Tables is only logical. This may even make the effort of (2) unnecessary: if an arrow.Table could be effortlessly converted into a datatable.Frame, then any differences in internal storage format will be irrelevant to the end-user.

wesm commented 6 years ago

hi @st-pasha,

  1. Data representation

You wrote:

Arrow was called Feather back then

This isn't correct. Here is the blog post introducing the Feather format 2.5 years ago

https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/

Critically, I wrote:

Around this time, the open source community had just started the new Apache Arrow project, designed to improve data interoperability for systems dealing with columnar tabular data. In discussing Arrow in the context of Python and R, we wanted to see if we could design a very fast file format for storing data frames that could be used by both languages. Thus, the Feather format was born.

Feather is very clearly from its first announcement, a minimalist storage format that uses the Arrow format, and a compact subset of it. Feather was a 2 week experiment to socialize the idea of interoperable data frames with the R and Python communities -- a success, I think, but I guess that it's led to some misconceptions about the Arrow project in general.

  1. NAs

I would be interested in exploring the NA analysis in more depth with you. My guess (which we can verify through benchmarking) is that the performance difference between bitmasks and sentinel values is not going to drive the bottom-line performance of these projects.

Bitmasks have other significant benefits:

One of the main operations where performance may be somewhat degraded is reductions like x.sum(); from my analysis the performance loss is not very significant.

On

arrow format is extended to support "R-style" NAs.

There is nothing to stop you from annotating a type with custom metadata to indicate that you are using a sentinel NA values. The downside is that a foreign Arrow consumer won't necessary know about your special values, but I believe that's an acceptable compromise.

  1. Conversions

Since you don't support nested data, converting from flat Arrow tables to the tables in this library is pretty simple, so even if you never move to Arrow natively internally, you can still benefit from the Arrow ecosystem with an extra marshal step.

wesm commented 6 years ago

Another matter to keep in mind: MPL 2.0-licensed code cannot be reused in projects (except in binary form) in the Apache Software Foundation.

https://www.apache.org/legal/resolved.html#category-b

st-pasha commented 6 years ago

This isn't correct. Here is the blog post introducing the Feather format 2.5 years ago

Ah, my bad. Being the author, you, of course, know better. I guess what I meant was that, purely from a user's perspective, there was a package called "feather(-format)". And that package was used to create ".feather" files. The Arrow, as a technology, did not have a standalone presence in python. Today, the feather package was merged into the pyarrow package.

MPL 2.0-licensed code cannot be reused in projects (except in binary form) in the Apache Software Foundation

Generally, MPL-2-licensed code can be included in source form into other projects, including into the Apache-v2 projects, provided the following condition is met: The included source files retain their MPL-2 license. Your project will still be Apache-v2, but somewhere inside there will be an island of MPL-licensed code. Your project can still be used freely, for any purpose, except that the license of that small MPL piece cannot be altered. This licensing peculiarity really only becomes relevant if someone were to take your project, modify it, then relicense under a closed-source license (i.e. GPL or commercial software), and finally distribute -- then and then only they will be forced to publish whatever changes they made to the MPL piece.

Now, it is possible that Apache Software Foundation puts additional restrictions on which licenses can or cannot be used, beyond what is allowed by Apache-v2 license. However, I do not know how to change their mind.

wesm commented 6 years ago

The Arrow, as a technology, did not have a standalone presence in python.

I don't know what you're saying; factually speaking this is not correct either.

It's true that the libraries did not have a huge install base at the time that this repository's history commenced, but to say that an active Arrow-for-Python project did not exist is not accurate. You can have a look at the source tree for yourself as of 2/2017 https://github.com/apache/arrow/tree/f6924ad83bc95741f003830892ad4815ca3b70fd/python.

Here are some blog posts written then or earlier:

However, I do not know how to change their mind.

Reciprocity at the source level is not something the ASF is going to change its mind about. Being able to use, but not modify, a piece of code practically makes that code useless in the context of an Apache-2 project IMHO. Even outside of an ASF project (e.g. in pandas or another project) I would not be willing to take on weak copyleft code because of the risk of IP contamination.

But perhaps this is venturing into open source ideology. I would advise you to consider changing your license to Apache-2 to be more compatible with the Python ecosystem, which is BSD/Apache-2-centric.

I would have liked to have had the opportunity to discuss these things with you 18 months ago but I only recently became aware of this project's existence and I'm reaching out now to see if we can find a way to help each other going forward. Each developer is entitled to their own decisions, but it is a shame that our work is not more mutually beneficial at the moment. To summarize the points I've made:

st-pasha commented 6 years ago

But perhaps this is venturing into open source ideology. I would advise you to consider changing your license to Apache-2 to be more compatible with the Python ecosystem, which is BSD/Apache-2-centric.

I agree, it would have been nice to have Apache-2 license for the project. I'm not entirely free in this choice. Still, the current situation is much better than if it was GPL. I do not know whether it will be feasible to switch to Apache in the future, but I'd be looking into that direction. At the very least, I believe I can dual-license certain parts of the code that might be useful for other projects.

Speaking of GPL. There is a package called plotnine, which is an awesome visualization library, based on R's ggplot2. Being a derived work, it carries a GPL-2 license; however, the author of plotnine claims that he'd be much happier if he could license his work under BSD-2. Which means if Hadley -- the main author of ggplot2 -- was kind enough to dual-license the "data transformation pipeline" part of his library as GPL/BSD -- then the Python community will be enriched with another great free-to-use package.

wesm commented 6 years ago

How would Python programmers be able to reuse an R package? I don't think there's anything wrong with releasing GPL packages in R because the whole ecosystem is heavily copyleft. In the Python data ecosystem, copyleft is unusual and will generally result in people either not using or not contributing to a project. I certainly would not.

mattdowle commented 6 years ago

@wesm Are you aware of data.table's change of license from GPL to MPL? The PR was https://github.com/Rdatatable/data.table/pull/2456. I'm interested in your thoughts on my thoughts expressed in some detail there.

wesm commented 6 years ago

I can read that in more detail and comment in more depth when I have some time. In the R world, GPL vs. MPL doesn't seem to make a practical difference since distributing applications written in R will generally carry a transitive GPL dependency. There are situations where MPL would be better though.

Personally, my preference is for all contributors to an open source project that I am involved with to have equal, unconstrained freedoms. This includes the freedom to create and distribute a closed source derivative work. In our modern times it is incredibly difficult to sustain open source projects, and to place restrictions on the reuse of works contributed to a project does not seem fair to me. So if I contribute a patch to a project, I want to be able to reuse my patch as I please. And I think it is fair that everyone else that I ask to contribute to the same project have that same freedom. I don't think people who feel differently are "bad" or "wrong", it's just my preference about what kinds of projects I will choose to involve myself with.

I understand there are OSS contributors who do not want to allow others to create closed source derivative works, and maybe they won't contribute to an Apache2/MIT-licensed project. That's OK with me; at that point we are basically having a religious debate and I think each person is entitled to their opinion. I don't think there's a license available that will satisfy all parties.

mattdowle commented 6 years ago

@wesm Great - yes please do read the PR in full when you've time. I'm hoping you will appreciate that at least I did try to find middle ground.

wesm commented 6 years ago

hi @mattdowle, I read through your PR. I appreciate the time and thoughtfulness you put into it. I think it articulates the general copyleft sentiment and the use of MPL allows closed source products (like the ones of your employer) to use the open source project in unmodified form without IP contamination.

When I spoke about open source ideology above, here is where we diverge

What we intended was that we contributed free software in the open under one condition: that if you improve data.table itself and distribute your improved data.table as a product, then that product must be open-source too.

and

We do not like those licenses for data.table because we think them unfair to the contributors.

I think this is the irreconcilable difference of copyleft adherents and permissive license adherents.

I do not feel that permissive licenses are unfair to contributors; quite the opposite: I think that strong copyleft and weak copyleft licenses are unfair to contributors. Why? Well, because, by act of contributing your IP to a project, that IP has basically stopped being yours anymore (because it generally only has value in the context of where it is contributed). It now belongs to the Project, and you cannot use it as you wish. The copyright holders can all agree to change the license later, but if you change to Apache/MIT, as you point out, there's no going back.

In permissively-licensed projects, the IP never stops being yours. In fact, the IP belongs to everyone. Your copyright is preserved, and you agree (in the case of Apache-2) that, for any software patents you hold (in the United States), you are permitting users of that software an irrevocable right to use it as they please.

Personally, if someone creates "pandas pro" and makes money off it, it'd be perfectly fine with me. That's the deal I've made with the open source community. By the same token, I can customize the software for my own purposes and ship it in a closed source project. I can understand the desire to prevent any of these things from happening.

As one small point: one head-scratching bit about MPL 2.0 is that it seems to be a bit of a No-Man's-Land between GPL-like licenses and permissive licenses. It's too liberal to satisfy Stallmanites, and too restricted (because of the inability to reuse code in permissive projects) to attract permissively-minded contributors. I will be interested to see what kinds of individuals are attracted to contributing to these projects.

As another point: what license makes most sense is heavily project dependent. I mostly work on "data tool" projects where I have relied on a great deal of code reuse from other projects and have been able to make a lot of rapid progress that way, for the benefit of the open source world. For building "end user applications", where reuse of code might not be as interesting, I could see myself using GPL or contributing to a GPL project. But not for something related to data science or data manipulation. That's just my preference.

In any case, I look forward to finding some ways to collaborate when it comes to the Apache Arrow ecosystem -- there's a lot of exciting stuff going on or being planned for the coming years, and I'd like as many projects and ecosystems to benefit from the fruits of our labors as possible. I dream of a less fragmented and more efficient and productive world, and that's what drives a lot of what I do. Thanks for reading

wesm commented 6 years ago

By the way, one of the stipulations of projects in the Apache Software Foundation is that derivative works are not allowed to use the project trademark to promote their product, except by the limited "Powered by" use. So you could sell "Data Power Tool powered by Apache Arrow" (OK) but not "Apache Arrow Pro" (enforceable trademark violation)

mattdowle commented 6 years ago

Thanks @wesm for reading and your thoughts. I believe we are going to meet for the first time at DSC Stanford on Monday/Tuesday so perhaps we can discuss further there. I look forward to meeting you.

If pydatatable was Apache-2 copyright H2O, would you contribute/use it? Or need it be Apache-2 copyright Apache Foundation before you would contribute/use?

I think that strong copyleft and weak copyleft licenses are unfair to contributors. Why? Well, because, by act of contributing your IP to a project, that IP has basically stopped being yours anymore (because it generally only has value in the context of where it is contributed). It now belongs to the Project, and you cannot use it as you wish.

If I understand correctly, this isn't true. When r-datatable contributors contribute, they still own the IP for their changes. When I changed the license of r-datatable I couldn't do that without asking every single past contributor to the project, which I did. If the IP didn't still belong to the contributors, I wouldn't have had to ask them. They can take their own contributions and do with them as they wish. Contrast this to projects which are owned/copyright by a company, specific person or a foundation. There I find it easier to see how the contributor is giving up their IP to the project owner.

Personally, if someone creates "pandas pro" and makes money off it, it'd be perfectly fine with me.

I find this hard to believe! Really? Perhaps you mean that you don't think that will happen, no? Let's say it was called grizzly to avoid the trademark violation and had the same API as pandas so everyone switched to it easily by paying for it.

I dream of a less fragmented and more efficient and productive world, and that's what drives a lot of what I do.

I agree and this is admirable. I look forward to meeting you on Monday/Tuesday where perhaps we can discuss further.

wesm commented 6 years ago

Yes, let's definitely talk more in person next week -- looking forward. A couple small points to your comments

If pydatatable was Apache-2 copyright H2O, would you contribute/use it? Or need it be Apache-2 copyright Apache Foundation before you would contribute/use?

h20 copyright would be fine. We acknowledge the copyright holders for third party code that has been reused in part or whole in Apache Arrow: see bottom of https://github.com/apache/arrow/blob/master/LICENSE.txt

If I understand correctly, this isn't true. When r-datatable contributors contribute, they still own the IP for their changes.

From a legal standpoint, you are correct. From a practical standpoint what I'm saying is that the IP may have significantly less value to you because it is usually only useful in the context of the project where the IP was contributed (that's what I meant by "because it generally only has value in the context of where it is contributed"). I imagine that a large fraction of data.table patches are so specific to data.table that they would have little practical value outside of the project, or have to be largely rewritten to be used elsewhere.

When you contribute to a permissive project, the value of your IP is not diminished as you can use it freely along with the code that it requires to be useful (the rest of the project and the IP created by others). Regardless of your ideology (copyleft-leaning vs not) I think you must concede that this is the case (restrictions on reuse vs. no restrictions on reuse).

I find this hard to believe! Really? Perhaps you mean that you don't think that will happen, no? Let's say it was called grizzly to avoid the trademark violation and had the same API as pandas so everyone switched to it easily by paying for it.

Yep, I am 100% serious (assuming that a trademark violation or otherwise marketing abuse did not occur).

Personally, the value I derive from open source is the community and the free exchange of ideas and code. Someone who goes closed source and starts selling something is closing themselves off to the OSS community in a lot of cases.

I'm a bit of an anarchist in this regard -- if my IP is used to create a product that is successful or that is disruptive to the status quo in some way, that is a win in my book even if the win for me is not monetary in nature. In the best case scenario (which turns out to not be all that rare), the for-profit closed source project may serve as a source of funding and maintenance support for the project.

To play devil's advocate, a certain Major Tech Company has recently garnered a reputation for creating for-profit service offerings based on permissively licensed OSS projects, while giving little or nothing back to the OSS projects. This has definitely tested the faith of some. Personally I think it's a short-term-greedy-long-term-foolish thing -- that strategy can't be run forever if the community-oriented OSS world burns out and begins to fade away. Here's a talk I gave about this recently: https://blog.dominodatalab.com/importance-community-led-open-source/

srisatish commented 6 years ago

Good discussion folks! From an open source standpoint - H2O.ai (company supporting the amazing team of developers on datatable) is boldly committed to open source and in fact supports or prefers picking the most permissive model - Apache / BSD-type/MIT by default. However as strong supporters of our maker culture - we anoint our makers with the freedom to choose the licensing model that best represents their thinking for the project. In this case @mattdowle and @st-pasha the lead makers on the project proposed & chose MPL. They might be persuaded along with the rest of the developers on both the projects, perhaps :) I personally think support for Apache Arrow will bring a wider audience for this work. And the possibility of bringing the robust communities of data science in R and Python closer would be good sans fear of ownership and license pollution. Our open source philosophy is one that allows for maximum use and feedback - even from forces that seem competitive in the near term but help make big ideas succeed in the long run. Community building, feedback and freedom is at the core of our open source philosophy. Open source for us is not a business model - it is what we believe is the right thing to do for software and makers seeking big impact with their works should be encouraged to choose the most viral license.

yohplala commented 3 years ago

Duplicate with #1819?

Hoeze commented 3 years ago
* No support for arrays with ≥2^31 elements, which means that datasets with over 2B rows cannot be stored contiguously. This limitation also applies to the total length of all items in an array-valued column, which means for example that a string column cannot have over 2GB of data in it.

@st-pasha I think this is also not true any more:

Array lengths

Array lengths are represented in the Arrow metadata as a 64-bit signed integer. An implementation of Arrow is considered valid even if it only supports lengths up to the maximum 32-bit signed integer, though. If using Arrow in a multi-language environment, we recommend limiting lengths to 2^31 - 1 elements or less. Larger data sets can be represented using multiple array chunks.

Source: https://arrow.apache.org/docs/format/Columnar.html#array-lengths

st-pasha commented 3 years ago

This is true. Since Arrow became more accommodating, we are slowly shifting certain details of internal representations to conform to Arrow's format.

oleksiyskononenko commented 3 years ago

It would be at least nice, if you allowed to_arrow_table and from_arrow_table conversions.

This has been implemented recently:

Closing this issue, feel free to reopen if needed.