jreback commented 10 years ago

Contributing Guidlines / Help: https://github.com/pydata/pandas/wiki

Dev Docs http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html

Docs:

Doc Strings examples/links: https://github.com/pydata/pandas/issues/3439, https://github.com/pydata/pandas/issues/2916, https://github.com/pydata/pandas/issues/3324
~~Docs on ipython startup files: #5748~~
Links to API docs in the tutorials: https://github.com/https://github.com/pydata/pandas/issues/3705, #1967
better docs on DataFrame.apply (examples): https://github.com/pydata/pandas/issues/5299
~~GA docs: https://github.com/pydata/pandas/issues/3508~~
doc groupby NA group work-around: https://github.com/pydata/pandas/issues/5456
consistent imports in all documentation: #1967
DOC: improve groupby reference docs #6944
documenting cython class (eg Timestamp): "cyfunction is not a python function" #5218
Redesigning/reorganising the documentation website

Perf:

~~vbench on different group sizes: https://github.com/pydata/pandas/issues/6787~~
use seed in vbenches: https://github.com/pydata/pandas/issues/8144
pandas + airspeed velocity #8361

Tests:

matplotlib lib to check plots: https://github.com/pydata/pandas/issues/5379
harmonize testing namespace: https://github.com/pydata/pandas/issues/8023
verify timedelta algos: https://github.com/pydata/pandas/issues/5986
non_unique storage tests for HDFStore: https://github.com/pydata/pandas/issues/7813

Bugs:

Make changes in numpy API for 1.7: https://github.com/pydata/pandas/issues/8329
accept scalar in Panel construction: https://github.com/pydata/pandas/issues/8285
fillna bug: https://github.com/pydata/pandas/issues/8251
HDFStore modifying columns when passed: https://github.com/pydata/pandas/issues/7212
Plot label = None instead of provided label in line plot #8905
plot with kind=scatter fails when providing an array for the size #8852
Grouper(key='A') gives AttributeError when applying function #8795
ValueError exception with pd.resample #8683
to_csv issue with chunksize when large number of columns #8621
EASY: pd.option_context without 'with' changes option values #8514

Enhancements:

ENH: support TimedeltaIndex plotting #8711
as suggested below, an enhancement to read_csv (or maybe read_repr/string to allow round-triping of the repr (can also serve as a basis for read_clipboard)
raise on invalid compression options in HDFStore: https://github.com/pydata/pandas/issues/4582
~~accept Period in DatetimeIndex for start/end: https://github.com/pydata/pandas/issues/6780~~
dont use bare Exceptions: https://github.com/pydata/pandas/issues/7948
add more Series/Index ops: https://github.com/pydata/pandas/issues/6382
~~to_dict orient parm: https://github.com/pydata/pandas/issues/7840~~
sort_index to generic.py: https://github.com/pydata/pandas/issues/8283
get to take an axis argument: https://github.com/pydata/pandas/issues/6703
axis argument to append: https://github.com/pydata/pandas/issues/8295
better error on invalid input to cut: https://github.com/pydata/pandas/issues/7751
~~level kw to any/all: https://github.com/pydata/pandas/issues/8302~~
astype accepting a dict: https://github.com/pydata/pandas/issues/7271
~~clean up code by removing core/array.py: https://github.com/pydata/pandas/issues/8359~~

IO:

to_clipboard bug/improvements: https://github.com/pydata/pandas/issues/8304
to_html to actually create links: https://github.com/pydata/pandas/issues/2679, https://github.com/pydata/pandas/issues/6488, https://github.com/pydata/pandas/issues/4987
max_colwidth with groupby: https://github.com/pydata/pandas/issues/7856
date_formatting in to_csv not being passed thru: https://github.com/pydata/pandas/issues/7791
generate gbq schema: https://github.com/pydata/pandas/issues/8325
make to_latex work with multi-index: https://github.com/pydata/pandas/issues/8336
Series.to_html not working so well: https://github.com/pydata/pandas/issues/5563
to_csv issue with chunksize when large number of columns #8621

Excel Oriented:

More customization of Excel input/output could be great, i.e. making it easier to specify per-column colors/formatting, float formats, etc. The code base isn't too complicated there (just a mixture of the formatter and the ExcelWriter stuff) and you could make rapid progress because it's really easy to test and create samples. I think the result would be very immediately rewarding (better looking things, easier to make reports, etc.). Plus for #4679 and #8272 you'd get a better sense of pandas internals too.
dtypes per column in read in (#8212 and #8272)
Treat multiple rows/columns as MultiIndex/hierarchical columns #4679
More flexible output formatting for Excel (mentioned in #8191 , but I'm going to put up an issue about having something like per-column styles, also #1663).
Allow writing multiple tables to same sheet and/or setting starting position for a particular sheet (mentioned by Wes in an older issue but I can't find it right now).
extend excel writers to write to open document format #2311
allow ExcelWriter to automatically convert lists and dict to string #8188
Performance benchmarks for Excel writers and readers #7171
Support BytesIO output in ExcelWriter #7074

SQL:

ENH: specify dtype in to_sql per column: #8778
BUG: to_sql fails with datetime.time values with sqlite fallback mode #8341

More advanced:

add write_index option to HDFStore: https://github.com/pydata/pandas/issues/8319
start/stop on HDFstore fixed: https://github.com/pydata/pandas/issues/8287
block splitting in HDFStore: https://github.com/pydata/pandas/issues/8284
tz handling using HDFStore / fixed: https://github.com/pydata/pandas/issues/8165, https://github.com/pydata/pandas/issues/7775
PeriodIndex in HDFStore: https://github.com/pydata/pandas/issues/7796
Write meta data as a CArray: https://github.com/pydata/pandas/issues/6245
Modify methods for HDFStore: https://github.com/pydata/pandas/issues/6857
Integrate multi-index reindexing a bit more: https://github.com/pydata/pandas/issues/7895
Timedelta support in groupby: #5724
support numeric index ops: https://github.com/pydata/pandas/issues/8226
allow index to referenced like a column: https://github.com/pydata/pandas/issues/8162

Collaborative Efforts:

Missing data support in numpy: https://github.com/pydata/pandas/issues/8350

@jorisvandenbossche @cpcloud @TomAugspurger @hayd cc @shoyer cc @immerrr

jreback commented 10 years ago

cc @seth-p cc @rockg

jreback commented 10 years ago

most of these are doc/testing things. I looked thru the Good as first PR. Anyone have any issues to add that are not on that list?

jtratner commented 10 years ago

More customization of Excel input/output could be great, i.e. making it easier to specify per-column colors/formatting, float formats, etc. The code base isn't too complicated there (just a mixture of the formatter and the ExcelWriter stuff) and you could make rapid progress because it's really easy to test and create samples. I think the result would be very immediately rewarding (better looking things, easier to make reports, etc.). Plus for #4679 and #8272 you'd get a better sense of pandas internals too.

List of PRs (ordered from most interesting/most impact to least interesting):

dtypes per column in read in (#8212 and #8272)
Treat multiple rows/columns as MultiIndex/hierarchical columns #4679
More flexible output formatting for Excel (mentioned in #8191 , but I'm going to put up an issue about having something like per-column styles, also #1663).
Allow writing multiple tables to same sheet and/or setting starting position for a particular sheet (mentioned by Wes in an older issue but I can't find it right now).
extend excel writers to write to open document format #2311
allow ExcelWriter to automatically convert lists and dict to string #8188
Performance benchmarks for Excel writers and readers #7171
Support BytesIO output in ExcelWriter #7074

jreback commented 10 years ago

@jtratner thanks! i'll update!

jtratner commented 10 years ago

One other great (but self-contained) project, would be to convert a pandas DataFrame into a new BigQuery table when writing. I've been working with BigQuery quite a bit and it would be pretty simple to do and be a nice way to dig into dealing with column metadata. I'll put up an issue right now with more details.

jreback commented 10 years ago

@jtratner thanks! that would be great!

TomAugspurger commented 10 years ago

astype taking dict of column to dtype: https://github.com/pydata/pandas/issues/7271

jtratner commented 10 years ago

@jreback - I put it up, should be pretty simple to implement, except for how to handle int columns that gain NaN values. #8325

rockg commented 10 years ago

These are things that I would like to see:

Allow reindex to work without passing a completely new multindex (i.e., reindexing a level copies the other levels) #7895 Some HDFStore enhancements which should be straightforward #6857

jreback commented 10 years ago

thanks @TomAugspurger @rockg @jtratner

jorisvandenbossche commented 10 years ago

For the doc issues, maybe also https://github.com/pydata/pandas/issues/3705 and https://github.com/pydata/pandas/issues/1967?

jreback commented 10 years ago

@jorisvandenbossche thanks!

jorisvandenbossche commented 10 years ago

I was also thinking, some utility function that can 'read' the output of a DataFrame back in, would be something nice (for simple situations, you can use read_csv or better reaf_fwf, but for more complex things (index names, multi-indexes, ..) this does not work anymore I think)

jreback commented 10 years ago

@jorisvandenbossche not sure what you mean. except for the column index losing its name (not a multi-index though), csv round-tripping is preserving.

jorisvandenbossche commented 10 years ago

but I do not mean csv roundtripping, I mean console print roundtripping

jorisvandenbossche commented 10 years ago

Is there an easy way to read this in (the output as a string)?

In [1]: df = pd.DataFrame(np.random.randn(4,4), index=pd.MultiIndex.from_product([['a', 'b'],[0,1]], names=['l1', 'l2']), columns=['a', 'b', 'c', 'd'])

In [2]: df
Out[2]: 
              a         b         c         d
l1 l2                                        
a  0   0.426860  0.691807 -1.499024 -0.761304
   1   0.610488 -0.185976  0.788957 -0.952540
b  0   0.527709  0.239897 -0.842122  0.613876
   1   0.401288  1.689590  1.004487 -0.064500

Dealing with the multi-index, dealing with the sparse index, index names, ... (or to start with, not flipping on those)

jreback commented 10 years ago

I think the clipboard is pretty robust (its just read_csv underneath). needs various options specified, but csv is not a completely fungible format anyhow (unlike say HDF5 where you CAN store the meta-data).

In [25]: df.to_clipboard()

In [26]: pd.read_clipboard()
Out[26]: 
  l1  l2         a         b         c         d
0  a   0 -0.114687 -0.111372  1.116020 -1.127915
1  a   1  1.493011 -0.208416 -0.129818 -0.023854
2  b   0  0.904737 -0.213157 -0.214423  0.300431
3  b   1  0.043716 -0.027796 -0.462323  0.298288

In [29]: pd.read_clipboard(index_col=[0,1])
Out[29]: 
              a         b         c         d
l1 l2                                        
a  0  -0.114687 -0.111372  1.116020 -1.127915
   1   1.493011 -0.208416 -0.129818 -0.023854
b  0   0.904737 -0.213157 -0.214423  0.300431
   1   0.043716 -0.027796 -0.462323  0.298288

jorisvandenbossche commented 10 years ago

Yes, but what I mean is: if you have this output as a string, or you can copy it (eg from an example in the docs, from a question on stackoverflow, ...), can you convert this easily to a DataFrame in a new session. And using read_clipboard on my example above eg gives CParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6

jreback commented 10 years ago

@jorisvandenbossche hmm works for me on master.

I usually just copy-paste from a question and do this:

data = """

here is the copied data exactly.....

"""
df = read_csv(StringIO(data))

FYI, I tried making this work from just a string (e.g. read_csv its a bit non-trivial to figure this out actually)

jorisvandenbossche commented 10 years ago

Yep, that is what I also do, but still, I mostly have to adapt something to the original data to get it working. It would be fine if there is some utility that can read all output.

data = """              a         b         c         d
l1 l2                                        
a  0   0.426860  0.691807 -1.499024 -0.761304
   1   0.610488 -0.185976  0.788957 -0.952540
b  0   0.527709  0.239897 -0.842122  0.613876
   1   0.401288  1.689590  1.004487 -0.064500"""

pd.read_csv(StringIO(data), sep='\s+')

Can you read this in with read_csv without tweaking something? (but a bit deviating from the original issue here ..)

jreback commented 10 years ago

@jorisvandenbossche I suppose you could have a wrapper that 'tries' various things, but its non-trivial to simply guess, well you can, but their are so many edge cases its MUCH easier to just have the user specify it.

jorisvandenbossche commented 10 years ago

Are there that many edge cases? The output of the pandas __repr__ is rather wel defined, or not?

jreback commented 10 years ago

ahh, you are proposing a pd.read_csv(data,repr=True) which (so that we don't have ANOTHER top-level function! that basically figures out the options, hmm, interesting.

jreback commented 10 years ago

@jorisvandenbossche I updated in the Enhancements section.

jorisvandenbossche commented 10 years ago

maybe it doesn't need to be in a top level, other possibility is something like pd.util.read_repr. Seems like a nice little project for somebody to hack on that could be useful

jorisvandenbossche commented 10 years ago

8336

TomAugspurger commented 10 years ago

Oh, https://github.com/pydata/pandas/issues/5563 would be a good one (Series HTML repr)

jreback commented 10 years ago

@TomAugspurger

nice posts you have here: tomaugspurger.github.io/blog/2014/09/04/practical-pandas-part-2-more-tidying-more-data-and-merging

about 1/2 down you pass method='table' in to_hdf which is ignored (and means u get a perf warning) use format='table' and will work

shoyer commented 10 years ago

Would #8162 (Allowing the index to be referenced by name, like a column) be a doable?

I would love to see something like this happen in SF!

cpcloud commented 10 years ago

wow this is great everyone bravo!

jreback commented 10 years ago

let's see how much gets done!

of course the point of this list was to get as much dev time as possible (at the expense of other projects of course) :)

cpcloud commented 10 years ago

in case a brave soul would like to venture into the land of numpy internals:

Missing data support in numpy: https://github.com/pydata/pandas/issues/8350

cpcloud commented 10 years ago

nice-to-have:

pandas + airspeed velocity

demo: http://mdboom.github.io/astropy-benchmark/

adding to the top list

jreback commented 10 years ago

that looks sexy - can u create a new issue for asv? vbench like

cpcloud commented 10 years ago

yep. vbench was actually mentioned in the asv talk at scipy 2014

jankatins commented 10 years ago

Implemeting a CategoricalIndex https://github.com/pydata/pandas/issues/7629?

jreback commented 10 years ago

@JanSchulz I think out of scope for a 1-day event

jasongrout commented 10 years ago

in case a brave soul would like to venture into the land of numpy internals:

I should mention that Mark Wiebe (who knows a lot of numpy internals) will be there. Additionally (re: airspeed velocity), Michael Droettboom will be there.

immerrr commented 10 years ago

Is there a summary of this hackathon available online?

jreback commented 10 years ago

no summary - a few issues worked on / closed

jorisvandenbossche commented 9 years ago

@ all I am going to update this the coming days for the upcoming Bloomberg Hackaton this weekend 29-30 November. But if you have new things to add or updates on the list above, certainly post/edit!

pandas-dev / pandas

Bloomberg Hackathon #8323

8336