Closed c0indev3l closed 10 years ago
Interesting. I'm not totally clear on what you're trying to do above - why are you special casing row 0 (instead of just doing something with row 0 first and then changing the line to: for i in range(1, row):
instead? It's also unclear why you're using k and j in a for loop, but maybe I'm missing something.
Based on a quick read of gspread, the easiest thing to do would be something like:
values = wks.get_all_values()
header, rows = values[0], values[1:]
df = DataFrame(rows, columns=header)
That said, the above is particularly inefficient because it first stores all of the cells in Cell
objects, creates a defaultdict, stores everything, then uses a double list comprehension to get all the data.
1 big question: Why would you use gspread over the Google Data API? Are their particular advantages?
Thanks for improving this script
I'm using other index to have unique column name. Because type(df[colname]) is Pandas DataFrame (instead of Pandas Series)
I don't know Google Data API so I can't say.
Strong +1 on this! Couple of things:
pandas.io.parsers.TextReader
to parse the text. You just have to get the values into a list of lists of strings.@cpcloud - I'm guessing that oauth and separate authentication probably require using the Google Data API bindings (though gspread looks to have a decent amount of activity). The good part is that it should be relatively easy to test because we could set up a Google Apps account (and I wonder if it's possible to set up private Travis credentials). I was thinking the same thing about passing through the text parser.
@c0indev3l This type(df[colname])
should not return DataFrame
. You're not using a Panel
are you?
Also, minor point: should probably call this pandas.io.read_drive
(as in Google Drive).
Related: #4093
@cpcloud so it's Google Drive because we'd only read spreadsheets?
not necessarily, but that's the name of the service
not stuck on that name...let's get a working prototype b4 we start bikeshedding about names :)
more wondering if it's the same as the older drive issue?
i think it is, close the other? this one is further along
Google Drive also provide Google Table Fusion http://www.google.com/drive/apps.html#fusiontables
hm okay...well maybe read_google
.... still think we should just get something working...
Maybe
pd.io.gdrive.read_spreadsheet
for reading spreadsheet,
pd.io.gdrive.read_table
for reading a Table Fusion
but that's right that giving a name is probably not so important
What is important in my mind is also to wonder what kind of API Pandas will provide.
For example we can get a document using either its name or its url.
Does Pandas should "recognize" if what is given is a url ? Or user should explicitly send parameter as url='....' or name='...'
we have some code that attempts to detect urls in pandas/io/common.py
What will happen if the name of my document "looks like" an url ?
Try to name your spreadsheet document http://www.google.com/
dealing with spreadsheet names should happen way after dealing with the url of the workbook
So you won't be able to access to a spreadsheet with a name which will looks like an url.
An other question is what will happen if a spreadsheet will have columns with same name. Should the name be automatically incremented ? Or does Pandas provides a function to do this ?
@c0indev3l instead of hashing these details out now, can I suggest a process for this?
Then, after we have (1) - (3) done, we can do whatever we want with deciding how to get the appropriate worksheet, since the worksheet parsing part won't have to care about it.
@c0indev3l echoing @jtratner: Work out the parsing first (since that's the core feature being introduced here) then worry about the ancillary details (these are important too, but without correct text parsing they don't matter).
An other useful feature will be to automatically rename columns to have unique column name (like with pd.io.excel.read_excel). I asked about it on StackOverflow http://stackoverflow.com/questions/19071622/automatically-rename-columns-to-ensure-they-are-unique and I get this piece of code
df_columns = ['a', 'b', 'a', 'a_2', 'a_2', 'a', 'a_2', 'a_2_2']
def uniquify(df_columns):
seen = set()
for item in df_columns:
fudge = 1
newitem = item
while newitem in seen:
fudge += 1
newitem = "{}_{}".format(item, fudge)
yield newitem
seen.add(newitem)
list(uniquify(df_columns))
#>>> ['a', 'b', 'a_2', 'a_2_2', 'a_2_3', 'a_3', 'a_2_4', 'a_2_2_2']
But Pandas probably ever provide a way to "uniquify" columns name (I should have a look at read_excel
code to understand that part of Pandas)
I am not sure of the scope of this issue?
@cpcloud
i think feature creep for us, but interesting for an external library/module that does ... this requires authentication handling and setting up keys and such for an app ... would be nice for a lib that provide the automatiion for whatever can be automated and then uses pandas to parse the json or html (can't remember how these are delivered in API)
Anything happened since this last post around panda / Google Spreadsheet connectivity? I'm really curious about a OAuth2 supported Spreadsheet alternative to Pandas Excel interfaces.
Here is library that provides possibilities for uploading/downloading between spreadsheets and pandas DataFrame, for those who tries to find solutions in this issue.
@maybelinot thx for the reply, very appreciated, I try to give it a run soon!
Alternative solution - using gspread library:
gc = gspread.authorize(credentials) wks = gc.open('Spread Sheet Title') df = pd.DataFrame(wks.sheet1.get_all_records())
I think this is a great idea that could hugely help pandas users, and even attract new ones.
@kenhui521 What would the credentials include ?
Google Spreadsheet is an online spreadsheet. Google Document can be use to generate survey and results of survey will be stored as a Google Spreadsheet document.
Maybe Pandas should provide a
pd.io.gspread.read_frame
function that will read a given Google Spreadsheet document (using email, password, url or name of document and range) and return a DataFrame.pd.io.gspread.read_frame(email, password, filename, sheet, cell_range)
gspread
package could help http://burnash.github.io/gspread/I wrote a little bit of code for that... but it could probably be improve and add into Pandas.