scikit-hep / awkward-0.x

Manipulate arrays of complex data structures as easily as Numpy.
BSD 3-Clause "New" or "Revised" License
215 stars 39 forks source link

Request for .frompandas() function #215

Closed NumesSanguis closed 4 years ago

NumesSanguis commented 4 years ago

In your documentation you often mention awkward.topandas(), but how about the other way, a awkward.frompandas()?

I looked in the Python file where .topandas() was defined: https://github.com/scikit-hep/awkward-array/blob/d942fb8d4fae5e1dec35c70938e24c05207b3f31/awkward/util.py#L213 , but nothing about loading DataFrames there.

I also tried with some code, but this failed:

import pandas as pd

df = pd.DataFrame({"foo": [2, 8], "bar": [0.3, -0.9]})
print(type(df))
# <class 'pandas.core.frame.DataFrame'>
print(df.head())
#    foo  bar
# 0    2  0.3
# 1    8 -0.9

af = awkward.fromiter(df)
print(af)
# ['foo' 'bar']

df_awk = awkward.topandas(af, flatten=True)
print(type(df_awk))
# <class 'pandas.core.series.Series'>
print(df_awk.head())
# 0    foo
# 1    bar
# dtype: object

Applying .fromiter() only gets the column names.

TL;DR How to convert a Pandas DataFrame to an awkward-array and vice-versa?

jpivarski commented 4 years ago

If you don't have any MultiIndex indexes or columns to convert into JaggedArrays or Tables, respectively, then you can get each flat column out to a flat array with

df["foo"].values
df["bar"].values

If you want to wrap these up in an awkward Table pass them to the Table constructor.

Let me know if that's scanning you need. I didn't think a frompandas function would be necessary because (unless we're taking about interpreting MultiIndex in a special way) it's easy to do this manually. If this doesn't work or is insufficient for you, go ahead and reopen this issue.

NumesSanguis commented 4 years ago

@jpivarski Could you give me a working example with the DataFrame above? Because these are not working for me:

1:

import pandas as pd

df = pd.DataFrame({"foo": [2, 8], "bar": [0.3, -0.9]})
aw_table = awkward.Table(df)
awkward.topandas(aw_table, flatten=True)  # or False
# True: ValueError: this array has unflattenable substructure: [0, 2) -> float64
# False: ValueError: If using all scalar values, you must pass an index

2:

aw_table = awkward.Table(awkward.fromiter(df))
awkward.topandas(aw_table, flatten=True)  # or False
#      0
# 0  foo
# 1  bar

3:

aw_table = awkward.Table([df["foo"].values, df["bar"].values])
awkward.topandas(aw_table, flatten=True)  # or False
# True: ValueError: this array has unflattenable substructure: [0, 2) -> float64
# False: ValueError: If using all scalar values, you must pass an index

4:

aw_table = awkward.Table(awkward.fromiter([df["foo"].values, df["bar"].values]))
awkward.topandas(aw_table, flatten=True)  # or False
# True: 
#        0
# 0 0  2.0
#   1  8.0
# 1 0  0.3
#   1 -0.9

# False:
#              0
# 0      [2. 8.]
# 1  [ 0.3 -0.9]

In all 4 attempts the original DataFrame was not correctly reconstructed. What am I doing wrong?

I think converting from and to DataFrames would be a quite standard operation for people in the field of Machine Learning. So if it is indeed me not properly understanding, I think it would be good to include an example like this in the Documentation.

An argument for .frompands(), purely from a user-perspective: If they know .topandas() exists, they might try .frompandas(), because it would be natural if that function exists. Also, with v1.0.0, some logic might change to convert Pandas DataFrames. Manual convertion code might break, but a .frompands() can be changed without breaking user's code.

EDIT: Something like df["foo"].values requires that the inside of the DataFrame is known at code time. An assumption which is not necessarily True.


About closing issues: Non-contributers cannot reopen it.

From other issues here, you seem very keen at closing issues quickly, probably to only keep meaningful issues in the tracker? Understandable, however, this can give a feeling of non-appreciation to the user who just wants to help this project. In the end it's the Author of the issue who knows best when an issue is resolved and they feel appreciated if they have ownership of this.

This page gives a good overview of best practices for open source projects: http://zguide.zeromq.org/page:chapter6

The user who created an issue SHOULD close the issue after checking the patch is successful.

When one person opens an issue, and another works on it, it's best to allow the original person to close the issue. That acts as a double-check that the issue was properly resolved.

More reading on building online communities (if you're interested): https://hintjens.gitbooks.io/social-architecture/content/

If you still want to keep the issue tracker clean, I would recommend using a Stale bot (https://github.com/apps/stale). After some inactivity, it will mark it as stale, and eventually automatically close it.

jpivarski commented 4 years ago

Sorry—I didn't realize you couldn't open it. My only reason for closing it was so that I would have a better idea of which ones I need to worry about (i.e. the open ones). The volume of questions (not just from GitHub Issues) is getting to be a difficult management problem. Whenever I've been closing them, I've included some text to explain that it's not final—I've been considering them "done for now"—but that was based on the assumption that you could reopen them. I won't do that anymore. (Maybe I'll have to find a label or something, but I can't set labels on my phone.)

What happens when you call df["foo"].values? You get a one-dimensional NumPy array, right? The awkward.Table constructor takes arrays as arguments (i.e. noawkward.fromiter involved).

Does

awkward.Table(foo=df["foo"].values, bar=df["bar"].values)

do what you want?

jpivarski commented 4 years ago

I just got a chance to try this out on a computer and it works. More generally,

awkward.Table({name: df[name].values for name in df.columns})

for all your DataFrame's columns.

Back to the question of closing issues: it is very rare for the original poster to close the issue—the issues usually lay open for weeks after I think I've answered the question but the users don't follow up. I end up closing old issues in sweeps. I just did a sweep recently (more on uproot than awkward, I think) and resolved to start closing early. Each close had a message to try to avoid sounding dismissive.

I don't have any centralized tracking, but I should figure out how to do that. I'm getting bug reports, usage questions, and feature requests from GitHub Issues (where the bug reports and usage questions belong), StackOverflow (where I'm trying to redirect the usage questions), Slack, Skype, and email (where I'd rather not get any, since they're not public and they get mixed in with a lot of other conversations). I'm reading the GitBook you sent. Thanks!

NumesSanguis commented 4 years ago

Thank you! It seems to mostly work. Shouldn't flatten=True on a DataFrame without MultiIndex return the exact Pandas DataFrame as before?:

import pandas as pd

df = pd.DataFrame({"foo": [2, 8], "bar": [0.3, -0.9]})
print(df.foo)  # works

aw_table = awkward.Table({name: df[name].values for name in df.columns})
# no MultiIndex in this DataFrame, so flatten=True should work
aw_pd = awkward.topandas(aw_table, flatten=True)  # False works correctly though
print(aw_pd.foo)

Expected with flatten=True: Same DataFrame as before conversion to awkward.Table() Reality: TypeError: only integer scalar arrays can be converted to a scalar index


I would still argue for a .frompandas() function.

  1. While it might seem obvious to the developer of this library that awkward.Table({name: df[name].values for name in df.columns}) should work, to someone new to this library (like me or a colleague that has never used Awkward), this is still a long line of code that only tells me "it does something with a Pandas DataFrame". Like, it might only load a part of it, it might only partly support Pandas DFs and therefore we have to create a line of code for it. .frompandas(), however, would give a sense of full support.
  2. Googling Awkward from Pandas or Read Pandas DF into Awkward won't point to any documentation of you (because it doesn't exist). Maybe a StackOverflow question for this search query is on top, if the searcher is lucky (which might be outdated when 1.0.0 comes along). It might also help with finding relevant tutorials.
  3. If a Awkward Table is converted with awkward.topandas(array, flatten=False), and a column has dtype awkward, like in your documentation:
    df.x
    # Name: x, dtype: awkward

    there should be a faster way of loading it than a for loop through columns right? I mean, the DataFrame will already be in a Awkward compatibly format, so something faster than for should be possible right? That could be captured with a .frompandas().

If this still is not convincing, I give up and stop ranting about this ^-^ I just think .frompandas() will signal full support of Pandas, which would help the uptake of this library among the ML / Data Analyst / Data Scientist community.


Off-topic: About Awkward (You can make the above line in Markdown with ----- + newline)

I seriously think you're doing a great job! This library is very powerful and useful (as far I've tested it), so I can totally understand that you're swamped by questions and such, and it's hard to deal with all that (mostly alone?). I'm actually surprised that this library is not more popular among the Machine Learning community. I, for example, have long been looking for a way to deal with audio of varying length. I need to apply functions to each row, slice among them and store it in a format like HDF5. With Numpy I was limited to for-loops. I guess it is not that popular yet among ML, because images are the most popular type of data to build models with, and those are square, which means Numpy is enough.

If e.g. the PyTorch community, and in specific the Audio branch of it (torchaudio: https://pytorch.org/blog/pytorch-1.2-and-domain-api-release/#torchaudio-03-with-kaldi-compatibility-new-transforms), or maybe Text (which is now using stop-tokens to give all sentences the same length), gets on-board with this library, I think it will give it a huge boost in terms of community. I've tried the first promotion of your library already ^-^: https://github.com/pytorch/pytorch/issues/22169#issuecomment-555813390 The other major Deep Learning library, TensorFlow, already has a way to deal with irregular sized arrays: https://www.tensorflow.org/guide/ragged_tensor

I also see potential in better integration with Pandas. Now Pandas is limited to single values per column-row cell. With this library I think it's possible for a single cell to have a whole Array (image, audio, etc). Meaning, that you can have 1 reference, that is searchable, for data and labels.

jpivarski commented 4 years ago

I'll make sure that Awkward 1.0 has a frompandas that is symmetric with topandas. There's no reason to expect it to be faster, but it would be a more user-friendly interface.

Also, the explicit structure classes like JaggedArray and Table are becoming internal details, precisely because I saw that people were having trouble with them when I presented then in tutorials. There will be a single awkward.Array class for all types, and therefore "use the Table constructor" won't be something I can recommend. There will have to be "to" and "from" functions for all the external libraries we link to. (I guess I could have suggested using pyarrow to convert the DataFrame into Arrow, then use awkward.fromarrow, which already exists...)

You're right that the only Awkward 0.x documentation is on the README. When it became clear that Awkward 0.x was headed for a brick wall of maintainability and I had to fix the technical debt with a redesign, I was left with the problem that Awkward 0.x had no documentation. There's a trade-off between adding more to Awkward 0.x and getting the Awkward 1.0 sprint done, so I compromised by writing that really long README.

Awkward 1.0 will have good reference documentation, but I still need to learn how to write "how to" documentation. The biggest problem for me is to figure out what problems people need "how tos" for. That's why I'm trying to encourage the use of StackOverflow—it will give me more feedback on where the desire paths are.

I'm currently working alone in the maintenance of uproot and awkward while trying to get Awkward 1.0 into a usable state. Having summer students generally means developing more features, rather than easing the load, because the students projects have to be somewhat separate to be well-defined.

I really like that book you pointed me to, since it addresses exactly my problem—how to scale up a project beyond the individual developer level. (Before uproot, I never had enough of a userbase that maintenance was hard to keep up with.) Incidentally, the author of that book said that ZeroMQ had a no-feature-request policy: they grew a community of developers by only accepting contributed features, not requests. :) But I think a project needs to reach a certain maturity level before that's possible, so that user-developers can see how a contribution fits in.

NumesSanguis commented 4 years ago

Thanks for the explanation :)

The README is pretty good for something that has just been added as temporary solution while bridging the transition period to 1.0 ^-^ Most Python open source projects host there documentation on Read the Docs (free). You can setup a hook to GitHub, which will then automatically generate the documentation from a docs folder in your master-branch, and optionally others (keeps the documentation in sync with the actual code version). Here is how I did it: https://github.com/NumesSanguis/FACSvatar and the related Read the Docs: https://facsvatar.readthedocs.io/en/latest/ The documentation can either be written in .rst or Markdown format (last time I checked). While .rst is more advanced (why I chose it), I would recommend Markdown, as it is more generally well know. If I would choose now, I would choose Markdown. Please don't take my documentation as example though. It is quite neglected O:)

For your documentation structure, maybe it would be more clear if separated it in 4(+) topics: Concepts, beginner examples, interoperability with other libraries out there, and how Awkward works under the hood, and in that order.

Maybe this better first in your Awkward 1.0 docs? Feel free to copy it.

That's true about students for a project like this indeed.


Glad you like the book ^-^ It's true about asking .frompandas(). I should have named it: "Should we add a .frompandas() and how should we implement it?". Now I understand it is done with awkward.Table({name: df[name].values for name in df.columns}), I could write a PR for it :) Should I (if yes, please open this issue again)?

jpivarski commented 4 years ago

Uproot uses readthedocs, but it also has docstrings, which can be automatically rendered there. In my experience, users overwhelmingly have read the README, because a lot of their questions would have been answered if they went to the readthedocs. It might have something to do with it being to sites—scrolling is easier/more discoverable than clicking?

That's a good breakdown for the documentation, though the problem I need to solve are, "Which 'how' articles to write?" My guesses about what people need to know have been a little off, which is why I same to crowdsource it.

And, of course, so of this takes time!

You're free to write a PR for frompandas, and that would be especially good if you think others are going to hit that problem before Awkward 1.0 is ready in the spring (the goal is "ready without uproot" by March, "ready with uproot' by May).

You also found a bug in topandas: the flatten=True apparently doesn't work if the data are flat. That is the simpler case. Note that flatten=True is an entirely different code path than flatten=False; I'll be harmonizing them in the new version, but right now, flatten=False puts awkward arrays inside Pandas columns with a new dtype and flatten=True fully converts the awkward data into plain arrays before passing to Pandas.

NumesSanguis commented 4 years ago

Sorry it took so long. I created a function based on your insights. Please review it :)

jpivarski commented 4 years ago

This issue would probably automatically close when I merge the pull request, but since you've done all the work and it will be merged (because I approve), I'll close this now, just in case.