pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.65k stars 17.92k forks source link

Bloomberg Hackathon #8323

Closed jreback closed 9 years ago

jreback commented 10 years ago

Contributing Guidlines / Help: https://github.com/pydata/pandas/wiki

Dev Docs http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html

Docs:

Perf:

Tests:

Bugs:

Enhancements:

IO:

Excel Oriented:

SQL:

More advanced:

Collaborative Efforts:

@jorisvandenbossche @cpcloud @TomAugspurger @hayd cc @shoyer cc @immerrr

jreback commented 10 years ago

cc @seth-p cc @rockg

jreback commented 10 years ago

most of these are doc/testing things. I looked thru the Good as first PR. Anyone have any issues to add that are not on that list?

jtratner commented 10 years ago

More customization of Excel input/output could be great, i.e. making it easier to specify per-column colors/formatting, float formats, etc. The code base isn't too complicated there (just a mixture of the formatter and the ExcelWriter stuff) and you could make rapid progress because it's really easy to test and create samples. I think the result would be very immediately rewarding (better looking things, easier to make reports, etc.). Plus for #4679 and #8272 you'd get a better sense of pandas internals too.

List of PRs (ordered from most interesting/most impact to least interesting):

jreback commented 10 years ago

@jtratner thanks! i'll update!

jtratner commented 10 years ago

One other great (but self-contained) project, would be to convert a pandas DataFrame into a new BigQuery table when writing. I've been working with BigQuery quite a bit and it would be pretty simple to do and be a nice way to dig into dealing with column metadata. I'll put up an issue right now with more details.

jreback commented 10 years ago

@jtratner thanks! that would be great!

TomAugspurger commented 10 years ago
jtratner commented 10 years ago

@jreback - I put it up, should be pretty simple to implement, except for how to handle int columns that gain NaN values. #8325

rockg commented 10 years ago

These are things that I would like to see:

Allow reindex to work without passing a completely new multindex (i.e., reindexing a level copies the other levels) #7895 Some HDFStore enhancements which should be straightforward #6857

jreback commented 10 years ago

thanks @TomAugspurger @rockg @jtratner

jorisvandenbossche commented 10 years ago

For the doc issues, maybe also https://github.com/pydata/pandas/issues/3705 and https://github.com/pydata/pandas/issues/1967?

jreback commented 10 years ago

@jorisvandenbossche thanks!

jorisvandenbossche commented 10 years ago

I was also thinking, some utility function that can 'read' the output of a DataFrame back in, would be something nice (for simple situations, you can use read_csv or better reaf_fwf, but for more complex things (index names, multi-indexes, ..) this does not work anymore I think)

jreback commented 10 years ago

@jorisvandenbossche not sure what you mean. except for the column index losing its name (not a multi-index though), csv round-tripping is preserving.

jorisvandenbossche commented 10 years ago

but I do not mean csv roundtripping, I mean console print roundtripping

jorisvandenbossche commented 10 years ago

Is there an easy way to read this in (the output as a string)?

In [1]: df = pd.DataFrame(np.random.randn(4,4), index=pd.MultiIndex.from_product([['a', 'b'],[0,1]], names=['l1', 'l2']), columns=['a', 'b', 'c', 'd'])

In [2]: df
Out[2]: 
              a         b         c         d
l1 l2                                        
a  0   0.426860  0.691807 -1.499024 -0.761304
   1   0.610488 -0.185976  0.788957 -0.952540
b  0   0.527709  0.239897 -0.842122  0.613876
   1   0.401288  1.689590  1.004487 -0.064500

Dealing with the multi-index, dealing with the sparse index, index names, ... (or to start with, not flipping on those)

jreback commented 10 years ago

I think the clipboard is pretty robust (its just read_csv underneath). needs various options specified, but csv is not a completely fungible format anyhow (unlike say HDF5 where you CAN store the meta-data).

In [25]: df.to_clipboard()

In [26]: pd.read_clipboard()
Out[26]: 
  l1  l2         a         b         c         d
0  a   0 -0.114687 -0.111372  1.116020 -1.127915
1  a   1  1.493011 -0.208416 -0.129818 -0.023854
2  b   0  0.904737 -0.213157 -0.214423  0.300431
3  b   1  0.043716 -0.027796 -0.462323  0.298288

In [29]: pd.read_clipboard(index_col=[0,1])
Out[29]: 
              a         b         c         d
l1 l2                                        
a  0  -0.114687 -0.111372  1.116020 -1.127915
   1   1.493011 -0.208416 -0.129818 -0.023854
b  0   0.904737 -0.213157 -0.214423  0.300431
   1   0.043716 -0.027796 -0.462323  0.298288
jorisvandenbossche commented 10 years ago

Yes, but what I mean is: if you have this output as a string, or you can copy it (eg from an example in the docs, from a question on stackoverflow, ...), can you convert this easily to a DataFrame in a new session. And using read_clipboard on my example above eg gives CParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6

jreback commented 10 years ago

@jorisvandenbossche hmm works for me on master.

I usually just copy-paste from a question and do this:

data = """

here is the copied data exactly.....

"""
df = read_csv(StringIO(data))

FYI, I tried making this work from just a string (e.g. read_csv its a bit non-trivial to figure this out actually)

jorisvandenbossche commented 10 years ago

Yep, that is what I also do, but still, I mostly have to adapt something to the original data to get it working. It would be fine if there is some utility that can read all output.

data = """              a         b         c         d
l1 l2                                        
a  0   0.426860  0.691807 -1.499024 -0.761304
   1   0.610488 -0.185976  0.788957 -0.952540
b  0   0.527709  0.239897 -0.842122  0.613876
   1   0.401288  1.689590  1.004487 -0.064500"""

pd.read_csv(StringIO(data), sep='\s+')

Can you read this in with read_csv without tweaking something? (but a bit deviating from the original issue here ..)

jreback commented 10 years ago

@jorisvandenbossche I suppose you could have a wrapper that 'tries' various things, but its non-trivial to simply guess, well you can, but their are so many edge cases its MUCH easier to just have the user specify it.

jorisvandenbossche commented 10 years ago

Are there that many edge cases? The output of the pandas __repr__ is rather wel defined, or not?

jreback commented 10 years ago

ahh, you are proposing a pd.read_csv(data,repr=True) which (so that we don't have ANOTHER top-level function! that basically figures out the options, hmm, interesting.

jreback commented 10 years ago

@jorisvandenbossche I updated in the Enhancements section.

jorisvandenbossche commented 10 years ago

maybe it doesn't need to be in a top level, other possibility is something like pd.util.read_repr. Seems like a nice little project for somebody to hack on that could be useful

jorisvandenbossche commented 10 years ago

8336

TomAugspurger commented 10 years ago

Oh, https://github.com/pydata/pandas/issues/5563 would be a good one (Series HTML repr)

jreback commented 10 years ago

@TomAugspurger

nice posts you have here: tomaugspurger.github.io/blog/2014/09/04/practical-pandas-part-2-more-tidying-more-data-and-merging

about 1/2 down you pass method='table' in to_hdf which is ignored (and means u get a perf warning) use format='table' and will work

shoyer commented 10 years ago

Would #8162 (Allowing the index to be referenced by name, like a column) be a doable?

I would love to see something like this happen in SF!

cpcloud commented 10 years ago

wow this is great everyone bravo!

jreback commented 10 years ago

let's see how much gets done!

of course the point of this list was to get as much dev time as possible (at the expense of other projects of course) :)

cpcloud commented 10 years ago

in case a brave soul would like to venture into the land of numpy internals:

Missing data support in numpy: https://github.com/pydata/pandas/issues/8350

cpcloud commented 10 years ago

nice-to-have:

pandas + airspeed velocity

demo: http://mdboom.github.io/astropy-benchmark/

adding to the top list

jreback commented 10 years ago

that looks sexy - can u create a new issue for asv? vbench like

cpcloud commented 10 years ago

yep. vbench was actually mentioned in the asv talk at scipy 2014

jankatins commented 10 years ago

Implemeting a CategoricalIndex https://github.com/pydata/pandas/issues/7629?

jreback commented 10 years ago

@JanSchulz I think out of scope for a 1-day event

jasongrout commented 10 years ago

in case a brave soul would like to venture into the land of numpy internals:

I should mention that Mark Wiebe (who knows a lot of numpy internals) will be there. Additionally (re: airspeed velocity), Michael Droettboom will be there.

immerrr commented 10 years ago

Is there a summary of this hackathon available online?

jreback commented 10 years ago

no summary - a few issues worked on / closed

jorisvandenbossche commented 9 years ago

@ all I am going to update this the coming days for the upcoming Bloomberg Hackaton this weekend 29-30 November. But if you have new things to add or updates on the list above, certainly post/edit!