Closed jreback closed 9 years ago
cc @seth-p cc @rockg
most of these are doc/testing things. I looked thru the Good as first PR. Anyone have any issues to add that are not on that list?
More customization of Excel input/output could be great, i.e. making it easier to specify per-column colors/formatting, float formats, etc. The code base isn't too complicated there (just a mixture of the formatter and the ExcelWriter stuff) and you could make rapid progress because it's really easy to test and create samples. I think the result would be very immediately rewarding (better looking things, easier to make reports, etc.). Plus for #4679 and #8272 you'd get a better sense of pandas internals too.
List of PRs (ordered from most interesting/most impact to least interesting):
BytesIO
output in ExcelWriter #7074@jtratner thanks! i'll update!
One other great (but self-contained) project, would be to convert a pandas DataFrame into a new BigQuery table when writing. I've been working with BigQuery quite a bit and it would be pretty simple to do and be a nice way to dig into dealing with column metadata. I'll put up an issue right now with more details.
@jtratner thanks! that would be great!
@jreback - I put it up, should be pretty simple to implement, except for how to handle int columns that gain NaN
values. #8325
These are things that I would like to see:
Allow reindex to work without passing a completely new multindex (i.e., reindexing a level copies the other levels) #7895 Some HDFStore enhancements which should be straightforward #6857
thanks @TomAugspurger @rockg @jtratner
For the doc issues, maybe also https://github.com/pydata/pandas/issues/3705 and https://github.com/pydata/pandas/issues/1967?
@jorisvandenbossche thanks!
I was also thinking, some utility function that can 'read' the output of a DataFrame back in, would be something nice (for simple situations, you can use read_csv
or better reaf_fwf
, but for more complex things (index names, multi-indexes, ..) this does not work anymore I think)
@jorisvandenbossche not sure what you mean. except for the column index losing its name (not a multi-index though), csv round-tripping is preserving.
but I do not mean csv roundtripping, I mean console print roundtripping
Is there an easy way to read this in (the output as a string)?
In [1]: df = pd.DataFrame(np.random.randn(4,4), index=pd.MultiIndex.from_product([['a', 'b'],[0,1]], names=['l1', 'l2']), columns=['a', 'b', 'c', 'd'])
In [2]: df
Out[2]:
a b c d
l1 l2
a 0 0.426860 0.691807 -1.499024 -0.761304
1 0.610488 -0.185976 0.788957 -0.952540
b 0 0.527709 0.239897 -0.842122 0.613876
1 0.401288 1.689590 1.004487 -0.064500
Dealing with the multi-index, dealing with the sparse index, index names, ... (or to start with, not flipping on those)
I think the clipboard is pretty robust (its just read_csv
underneath). needs various options specified, but csv is not a completely fungible format anyhow (unlike say HDF5 where you CAN store the meta-data).
In [25]: df.to_clipboard()
In [26]: pd.read_clipboard()
Out[26]:
l1 l2 a b c d
0 a 0 -0.114687 -0.111372 1.116020 -1.127915
1 a 1 1.493011 -0.208416 -0.129818 -0.023854
2 b 0 0.904737 -0.213157 -0.214423 0.300431
3 b 1 0.043716 -0.027796 -0.462323 0.298288
In [29]: pd.read_clipboard(index_col=[0,1])
Out[29]:
a b c d
l1 l2
a 0 -0.114687 -0.111372 1.116020 -1.127915
1 1.493011 -0.208416 -0.129818 -0.023854
b 0 0.904737 -0.213157 -0.214423 0.300431
1 0.043716 -0.027796 -0.462323 0.298288
Yes, but what I mean is: if you have this output as a string, or you can copy it (eg from an example in the docs, from a question on stackoverflow, ...), can you convert this easily to a DataFrame in a new session. And using read_clipboard
on my example above eg gives CParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6
@jorisvandenbossche hmm works for me on master.
I usually just copy-paste from a question and do this:
data = """
here is the copied data exactly.....
"""
df = read_csv(StringIO(data))
FYI, I tried making this work from just a string (e.g. read_csv
its a bit non-trivial to figure this out actually)
Yep, that is what I also do, but still, I mostly have to adapt something to the original data to get it working. It would be fine if there is some utility that can read all output.
data = """ a b c d
l1 l2
a 0 0.426860 0.691807 -1.499024 -0.761304
1 0.610488 -0.185976 0.788957 -0.952540
b 0 0.527709 0.239897 -0.842122 0.613876
1 0.401288 1.689590 1.004487 -0.064500"""
pd.read_csv(StringIO(data), sep='\s+')
Can you read this in with read_csv
without tweaking something? (but a bit deviating from the original issue here ..)
@jorisvandenbossche I suppose you could have a wrapper that 'tries' various things, but its non-trivial to simply guess, well you can, but their are so many edge cases its MUCH easier to just have the user specify it.
Are there that many edge cases? The output of the pandas __repr__
is rather wel defined, or not?
ahh, you are proposing a pd.read_csv(data,repr=True)
which (so that we don't have ANOTHER top-level function! that basically figures out the options, hmm, interesting.
@jorisvandenbossche I updated in the Enhancements section.
maybe it doesn't need to be in a top level, other possibility is something like pd.util.read_repr
. Seems like a nice little project for somebody to hack on that could be useful
Oh, https://github.com/pydata/pandas/issues/5563 would be a good one (Series HTML repr)
@TomAugspurger
nice posts you have here: tomaugspurger.github.io/blog/2014/09/04/practical-pandas-part-2-more-tidying-more-data-and-merging
about 1/2 down you pass method='table' in to_hdf which is ignored (and means u get a perf warning) use format='table' and will work
Would #8162 (Allowing the index to be referenced by name, like a column) be a doable?
I would love to see something like this happen in SF!
wow this is great everyone bravo!
let's see how much gets done!
of course the point of this list was to get as much dev time as possible (at the expense of other projects of course) :)
in case a brave soul would like to venture into the land of numpy
internals:
Missing data support in numpy: https://github.com/pydata/pandas/issues/8350
nice-to-have:
pandas + airspeed velocity
demo: http://mdboom.github.io/astropy-benchmark/
adding to the top list
that looks sexy - can u create a new issue for asv? vbench like
yep. vbench
was actually mentioned in the asv
talk at scipy 2014
Implemeting a CategoricalIndex https://github.com/pydata/pandas/issues/7629?
@JanSchulz I think out of scope for a 1-day event
in case a brave soul would like to venture into the land of numpy internals:
I should mention that Mark Wiebe (who knows a lot of numpy internals) will be there. Additionally (re: airspeed velocity), Michael Droettboom will be there.
Is there a summary of this hackathon available online?
no summary - a few issues worked on / closed
@ all I am going to update this the coming days for the upcoming Bloomberg Hackaton this weekend 29-30 November. But if you have new things to add or updates on the list above, certainly post/edit!
Contributing Guidlines / Help: https://github.com/pydata/pandas/wiki
Dev Docs http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html
Docs:
Docs on ipython startup files: #5748GA docs: https://github.com/pydata/pandas/issues/3508Perf:
vbench on different group sizes: https://github.com/pydata/pandas/issues/6787Tests:
Bugs:
Enhancements:
read_csv
(or mayberead_repr/string
to allow round-triping of the repr (can also serve as a basis forread_clipboard
)accept Period in DatetimeIndex for start/end: https://github.com/pydata/pandas/issues/6780to_dict orient parm: https://github.com/pydata/pandas/issues/7840level kw to any/all: https://github.com/pydata/pandas/issues/8302clean up code by removing core/array.py: https://github.com/pydata/pandas/issues/8359IO:
to_latex
work with multi-index: https://github.com/pydata/pandas/issues/8336Series.to_html
not working so well: https://github.com/pydata/pandas/issues/5563Excel Oriented:
SQL:
to_sql
per column: #8778More advanced:
Collaborative Efforts:
@jorisvandenbossche @cpcloud @TomAugspurger @hayd cc @shoyer cc @immerrr