pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.56k stars 17.89k forks source link

pd.io.gspread.read_frame - from Google Spreadsheet to Pandas DataFrame #5017

Closed c0indev3l closed 10 years ago

c0indev3l commented 11 years ago

Google Spreadsheet is an online spreadsheet. Google Document can be use to generate survey and results of survey will be stored as a Google Spreadsheet document.

Maybe Pandas should provide a pd.io.gspread.read_frame function that will read a given Google Spreadsheet document (using email, password, url or name of document and range) and return a DataFrame.

pd.io.gspread.read_frame(email, password, filename, sheet, cell_range)

gspread package could help http://burnash.github.io/gspread/

I wrote a little bit of code for that... but it could probably be improve and add into Pandas.

email = '...@...'
password = '...'
cell_range = 'A1:R20'

gc = gspread.login(email, password)
wks = gc.open(filename).sheet1

cell_list = wks.range('')

# Build a NumPy array
(row, col) = (cell_list[-1].row, cell_list[-1].col)
data = np.empty((row-1,col), dtype=object)
data[:] = np.nan

k = 0
cols = []
for i in range(row):
    for j in range(col):
        val = cell_list[k].value
        if i==0:
            if val != None:
                if val not in cols:
                    cols.append(val)
                else: # add a number if colname ever exists
                    ii = 1
                    while True:
                        new_val = val + '_' + str(ii)
                        if new_val not in cols:
                            break
                        ii += 1
                    cols.append(new_val)
            else:
                cols.append('col_'+str(j))
                #cols.append(j)
        else:
            if val != None:
                data[i-1, j] = val
        k += 1
jtratner commented 11 years ago

Interesting. I'm not totally clear on what you're trying to do above - why are you special casing row 0 (instead of just doing something with row 0 first and then changing the line to: for i in range(1, row): instead? It's also unclear why you're using k and j in a for loop, but maybe I'm missing something.

Based on a quick read of gspread, the easiest thing to do would be something like:

values = wks.get_all_values()
header, rows = values[0], values[1:]
df = DataFrame(rows, columns=header)

That said, the above is particularly inefficient because it first stores all of the cells in Cell objects, creates a defaultdict, stores everything, then uses a double list comprehension to get all the data.

1 big question: Why would you use gspread over the Google Data API? Are their particular advantages?

c0indev3l commented 11 years ago

Thanks for improving this script

I'm using other index to have unique column name. Because type(df[colname]) is Pandas DataFrame (instead of Pandas Series)

I don't know Google Data API so I can't say.

cpcloud commented 11 years ago

Strong +1 on this! Couple of things:

  1. It would be nice to support OAuth authentication, if possible, so that users don't have to pass plain text passwords around. This also enables users to not have to log in every time they want to use this. I would not want to store my password in a script so that I could reuse this as part of some larger code base. If this is already supported, then maybe some documentation about it is in order. If I missed that documentation, my apologies.
  2. You can probably use pandas.io.parsers.TextReader to parse the text. You just have to get the values into a list of lists of strings.
jtratner commented 11 years ago

@cpcloud - I'm guessing that oauth and separate authentication probably require using the Google Data API bindings (though gspread looks to have a decent amount of activity). The good part is that it should be relatively easy to test because we could set up a Google Apps account (and I wonder if it's possible to set up private Travis credentials). I was thinking the same thing about passing through the text parser.

cpcloud commented 11 years ago

@c0indev3l This type(df[colname]) should not return DataFrame. You're not using a Panel are you?

Also, minor point: should probably call this pandas.io.read_drive (as in Google Drive).

Related: #4093

jtratner commented 11 years ago

@cpcloud so it's Google Drive because we'd only read spreadsheets?

cpcloud commented 11 years ago

not necessarily, but that's the name of the service

cpcloud commented 11 years ago

not stuck on that name...let's get a working prototype b4 we start bikeshedding about names :)

jtratner commented 11 years ago

more wondering if it's the same as the older drive issue?

cpcloud commented 11 years ago

i think it is, close the other? this one is further along

c0indev3l commented 11 years ago

Google Drive also provide Google Table Fusion http://www.google.com/drive/apps.html#fusiontables

cpcloud commented 11 years ago

hm okay...well maybe read_google.... still think we should just get something working...

c0indev3l commented 11 years ago

Maybe pd.io.gdrive.read_spreadsheet for reading spreadsheet, pd.io.gdrive.read_table for reading a Table Fusion

but that's right that giving a name is probably not so important

What is important in my mind is also to wonder what kind of API Pandas will provide.

For example we can get a document using either its name or its url.

Does Pandas should "recognize" if what is given is a url ? Or user should explicitly send parameter as url='....' or name='...'

cpcloud commented 11 years ago

we have some code that attempts to detect urls in pandas/io/common.py

c0indev3l commented 11 years ago

What will happen if the name of my document "looks like" an url ?

Try to name your spreadsheet document http://www.google.com/

cpcloud commented 11 years ago

dealing with spreadsheet names should happen way after dealing with the url of the workbook

c0indev3l commented 11 years ago

So you won't be able to access to a spreadsheet with a name which will looks like an url.

An other question is what will happen if a spreadsheet will have columns with same name. Should the name be automatically incremented ? Or does Pandas provides a function to do this ?

jtratner commented 11 years ago

@c0indev3l instead of hashing these details out now, can I suggest a process for this?

  1. Assume we have the correct worksheet (i.e., either through gspread or the Google Data API), set up the processing for that specific worksheet.
  2. Add in the ability to pass all the text parser kwargs that are appropriate through the function.
  3. Add the ability to get the correct worksheet given a spreadsheet in gspread.

Then, after we have (1) - (3) done, we can do whatever we want with deciding how to get the appropriate worksheet, since the worksheet parsing part won't have to care about it.

cpcloud commented 11 years ago

@c0indev3l echoing @jtratner: Work out the parsing first (since that's the core feature being introduced here) then worry about the ancillary details (these are important too, but without correct text parsing they don't matter).

c0indev3l commented 11 years ago

An other useful feature will be to automatically rename columns to have unique column name (like with pd.io.excel.read_excel). I asked about it on StackOverflow http://stackoverflow.com/questions/19071622/automatically-rename-columns-to-ensure-they-are-unique and I get this piece of code

df_columns = ['a', 'b', 'a', 'a_2', 'a_2', 'a', 'a_2', 'a_2_2']

def uniquify(df_columns):
    seen = set()

    for item in df_columns:
        fudge = 1
        newitem = item

        while newitem in seen:
            fudge += 1
            newitem = "{}_{}".format(item, fudge)

        yield newitem
       seen.add(newitem)

list(uniquify(df_columns))
#>>> ['a', 'b', 'a_2', 'a_2_2', 'a_2_3', 'a_3', 'a_2_4', 'a_2_2_2']

But Pandas probably ever provide a way to "uniquify" columns name (I should have a look at read_excel code to understand that part of Pandas)

jreback commented 10 years ago

I am not sure of the scope of this issue?

@cpcloud

cpcloud commented 10 years ago

i think feature creep for us, but interesting for an external library/module that does ... this requires authentication handling and setting up keys and such for an app ... would be nice for a lib that provide the automatiion for whatever can be automated and then uses pandas to parse the json or html (can't remember how these are delivered in API)

spex66 commented 9 years ago

Anything happened since this last post around panda / Google Spreadsheet connectivity? I'm really curious about a OAuth2 supported Spreadsheet alternative to Pandas Excel interfaces.

maybelinot commented 9 years ago

Here is library that provides possibilities for uploading/downloading between spreadsheets and pandas DataFrame, for those who tries to find solutions in this issue.

spex66 commented 9 years ago

@maybelinot thx for the reply, very appreciated, I try to give it a run soon!

kenhui521 commented 8 years ago

Alternative solution - using gspread library:

gc = gspread.authorize(credentials) wks = gc.open('Spread Sheet Title') df = pd.DataFrame(wks.sheet1.get_all_records())

palewire commented 7 years ago

I think this is a great idea that could hugely help pandas users, and even attract new ones.

akshitasood63 commented 6 years ago

@kenhui521 What would the credentials include ?