pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.26k stars 17.8k forks source link

DISCUSS: move online data reader subpackages to separate project? #8961

Closed jorisvandenbossche closed 9 years ago

jorisvandenbossche commented 9 years ago

Opening an issue to discuss this: should we think about moving the functionality to read online data sources to a separate package?

This was mentioned recently here https://github.com/pydata/pandas/issues/8842#issuecomment-63456052 and https://github.com/pydata/pandas/pull/8631#issuecomment-61805926

Some reasons why we would want to move it:

Some questions that come to mind:

Pinging some of the people have worked on these subpackages (certainly add others if you know): @dstephens99 @MichaelWS @kdiether @cpcloud @vincentarelbundock @jnmclarty @jreback @immerrr @TomAugspurger @hayd @jtratner

immerrr commented 9 years ago

I was trying to think through the workflow of such migration and it hit me that there's one thing all those subprojects to be moved away get for free when being under the umbrella of pandas: the utilities aka the boring stuff, like setup scripts, docs, tests, benchmarks & ci infrastructure. Maybe we could think of a way to somehow move crucial parts of those into a subrepo(-s) so that a fix in one place can be easily propagated everywhere.

Also, as a food for thought, I think there are a lot of non-remote io packages that can benefit from moving away and being provided as plugins, e.g. fresh and unstable stata, clipboard, maybe excel. If you ask me, I'd say pretty much every library that works through a 3rdparty dependency could be worth separating and being maintained closer to that dependency package by a person having experience with it. One organizational bonus of such separation would be that pandas will have a lot less optional dependencies.

jreback commented 9 years ago

I am not in favor of moving any core IO packages eg stata,clipboard,excel

one of the mains points of pandas is that it provides a consistent interface with IO not sure having these as separate packages will help that and just add another dependency where except for excel don't actually have any deps

immerrr commented 9 years ago

FTR, clipboard does have deps, they are simply hidden under a layer of pandas.util.clipboard:

$ grep import pandas/util/clipboard.py 
#   import pyperclip
import platform, os
    import ctypes
                import gtk
                    import PyQt4 as qt4
                    import PyQt4.QtCore
                    import PyQt4.QtGui
                        import PySide as qt4
                        import PySide.QtCore
                        import PySide.QtGui
$ grep "os\.\(popen\|system\)" pandas/util/clipboard.py 
    outf = os.popen('pbcopy', 'w')
    outf = os.popen('pbpaste', 'r')
    outf = os.popen('xclip -selection c', 'w')
    outf = os.popen('xclip -selection c -o', 'r')
    outf = os.popen('xsel -i', 'w')
    outf = os.popen('xsel -o', 'r')
    xclipExists = os.system('which xclip > /dev/null') == 0
        xselExists = os.system('which xsel > /dev/null') == 0
jreback commented 9 years ago

@immerrr hah..I knew you were going to say that. However, these are system level for the most part and generally included on linux/mac/windows (except the PyQt). So not that big of a deal.

immerrr commented 9 years ago

Also, it depends on what do you call "core". I always thought of pandas core being containers+time utils+split/apply/combine+high perf queries/computations. The rest could (should) just feed data in or out via a high-throughput API.

That being said, I agree that unified API is good, so it should be sorted out before starting to pull pandas codebase apart. And yes, for practicality reasons, most frequently used io modules could be considered core, too, I just don't have the data to figure out which of them are most popular.

jorisvandenbossche commented 9 years ago

Maybe we should limit the discussion first to the remote data readers? As the other io modules are indeed much more fundamental -> this would require a more general discussion on the scope of pandas as the core project. And the remote data could also be a good exercise to see how smooth this goes / what obstacles there are (things that would also come up in a broader discussion), and that already will be difficult enough to sort out now:

jreback commented 9 years ago

I think absolutely these sub-projects should use the existing tests (and pandas testing infrastructure). Doc building actually could be on independently via readthedocs (no cython in these projects). Pandas could maintain linkage from its docs to the sub-projects. Though I think they should physically be separate.

immerrr commented 9 years ago

I think absolutely these sub-projects should use the existing tests

It's not obvious from the phrasing, I hope you mean that they should re-use the infrastructure, e.g. build matrix, but neither they shouldn't run tests of core pandas nor core pandas should run tests of its "plugin" packages. Which brings me to another issue which I'll describe in separate gh issue.

Speaking of build matrix, I'd expect the latter to be tested against pandas master and at least one last pandas major release. Or maybe two. The second one would ensure that when running for a fix for last-minute incompatible change in newly released core version one doesn't break the latest major release as of several days ago and don't force users to upgrade both pandas-io-foo and pandas in lockstep. But then again, if one wants to use a bleeding edge version of pandas-io-foo package they should be able to do the same for pandas itself.

Another issue is that I don't know of a way to include one yaml file inside the other which stacks with the fact that travis.yml should reside in repository root. I'm not yet sure I see how to make travis/appveyor.yml generic enough to require minimal (ideally, none whatsoever?) attention after project is bootstrapped and yet rely on some common submodule so that exact versions can be changed with a single commit to some "shared" repo and a single submodule update. Symlinking travis.yml to a submodule might do, but I'm not sure if that works on windows.

femtotrader commented 9 years ago

Hello,

I've been working on a more object oriented approach about DataReader than actual code.

so it can use urlopen or requests (for now I'm only using requests as my goal was to use requests-cache) https://github.com/pydata/pandas/issues/8713

see this https://github.com/femtotrader/pandas_datareaders Sorry, I wasn't aware of this "issue".

This is just a friendly "fork", please excuse me.

I would be very pleased to have your feedback about this.

hayd commented 9 years ago

+1, it sucks that someone (stuck in the dark ages) using 0.10 or something can't use data readers (if something's changed in the api since then) without upgrading and potentially breaking/changing lots of their code.

I suspect for the most part we're not using any wild/depreciating commands, that aren't tested elsewhere in pandas, the datafeed codebase... so IMO testing against master isn't actually that important (and not doing so makes things much easier), just add the pandas version - e.g. 0.16.rc1 - to travis for the package right before a pandas release... see if anything breaks. We could potentially provide a way to test all these packages from within pandas (for a while we should keep the tests around in pandas), but tbh I don't think this is so important: a failure right before a release would be a big edge case (that's why we have RCs) and all you'd have to do is fix up the package and release a new version.

For the backwards compat these have to dependancies of (and depend on*) pandas right?

I think the easiest is to add this right here in the pydata umbrella group, or create another group here; that way it can seem more "official".

*though it may be interesting to lift that direction, and have pandas as a soft dependancy (like https://github.com/changhiskhan/poseidon does).

Does this make sense?


@femtotrader did you copy and paste the classes and tests from pandas or is this something different? (I worry a little about change requests dep at the same time as migrating, but +1 on using requests).

femtotrader commented 9 years ago

@hayd this is something different as (except Yahoo Options) in data.py https://github.com/pydata/pandas/blob/master/pandas/io/data.py almost everything is function... no class. Some people considers that we shouldn't depend on requests. I think that using these classes we can both depend on requests if we want or use (urllib or urllib2) urlopen.

jnmclarty commented 9 years ago

One important consideration is the recruiting effects of newer contributors. I have points on each of the three, speaking as a rookie myself. I comment on the docs as well, as @jreback pulled that into this thread.

Carve out Documentation
My recommendation : If anything gets carved out, documentation should be the highest priority.
Carve out Data Sources
My recommendation : Add a note to the docs advising that maintainers are wanted. And, as they step up, one by one they can be carved out. For instance, I could likely learn, and volunteer, to be a maintainer for the World Bank stuff, but I wouldn't necessarily want to be a maintainer for the other stuff.
Carve out IO
My recommendation : It has it's benefits, but I think the costs outweigh them. I think the "darkages" reference above has merit, but it is trumped by the complicatedness that pandas would become.
hayd commented 9 years ago

@jnmclarty The problem with carving out docs is IMO that docs and code are "hard-linked" (changes in the code api means changing the docs... at the same time), I think @jreback is talking about docs for these sub-projects (not pandas docs in general/core).

Note: you can change the docs without compiling, you can even do it within github!

I envision users of older pandas (dark ages) being able to monkey-patch (depending on their code):

# import pandas.io.data as web  # previous way (and still works in newer pandas, but with latest data package)
import pandas_data_readers as web  # becomes, to work in older pandas, to use latest data package

This reminds me of @jreback's presentation "pandas is pydata middleware": pandas is pydata middleware

Ripping it out and seeing what happens: https://github.com/hayd/pandas_data_readers (it works without change for pandas 0.13-0.15, there's a few compat import issues pre-0.13 see https://travis-ci.org/hayd/pandas_data_readers/builds/43218562).

Edit: may be easy to also break apart the compat module - would make fixing earlier version easy - not sure if that's feasible though.

jnmclarty commented 9 years ago

@hayd Fair point, on the docs. I admit, I'm relatively new to OS planning concepts like this. I do find it all pretty interesting.

bashtage commented 9 years ago

I don't think stata should be considered unstable. It is heavily tested, written against the Stata file format spec, and hasn't had a show stopper bug* in a number of releases (mostly small things like increasing file size in a read/write cycle).

*Excluding big endian issues, which are probably not important for users of big endian systems, assuming these exist.

femtotrader commented 9 years ago

For doc "Read the docs" is very convenient https://readthedocs.org/. With webhook you can compile doc on server side when you commit changes.

jreback commented 9 years ago

@femtotrader and for a datareader package this will probably work. Readthedocs won't handle a full build which requires things like compiling (that's the reason pandas does not host this way).

hayd commented 9 years ago

@jreback how would readthedocs for a datareader package sit in the pandas docs (url-wise)? Tbh could just keep the docs as they are on the pandas side, mostly updates to datareaders are just fixes to external APIs?

From the (pretty trivial) exercise I did splitting out datareaders, I think we should split it - it means people can use "old" pandas and up-to-date (working) datareaders. I think should do the same for pandas_compat. Happy to put a PR together and see if it just worksTM.

Not sure how fits with @femtotrader's thoughts?

femtotrader commented 9 years ago

You really want my opinion ;-)

My DataReader version is better because it supports cache and "that's useful for me"TM ;-) (but also some other people)

but yes I think that we should split datareaders to a separate GitHub project, a separate pip package, a separate testing and continuous integration process and have an easy doc build using readthedocs so update of this very moving part will be easier and have a link in pandas doc for datareaders

Maybe we should do this in 2 steps.

First move and keep exactly same code. Then add my wonderful cache features in a second step ;-) @twiecki seems interested for zipline see https://github.com/quantopian/zipline/pull/398#issuecomment-69718174

That's just my opinion.

PS: I can grant access to either http://pandas-datareaders.readthedocs.org/en/latest/ or https://github.com/femtotrader/pandas_datareaders that's not a problem

jreback commented 9 years ago

ok, let's create another package under pydata.

femtotrader commented 9 years ago

I also forget to say that I can also grant access to https://pypi.python.org/pypi/pandas_datareaders (even ownership to @wesm and @jreback because they are pandas pypi package owner). I just want this part of Pandas to be improved.

Please just say me what to do now. I will be able to help friday and a part of next week.

Maybe we could first move the doc.

hayd commented 9 years ago

@femtotrader what's the backwards compat situation with using pandas_datareaders vs pandas.

That is: What we want IMO is to just drop in pandas_datareaders (or whatever) into pandas.io, that way we don't break users code (whilst it's independent/for now they must replace from pandas.io import ... with from pandas_datareader import ... and it "just work"). If we do that does your lib pass the tests?

Make sense?

(also, does your lib work on 3.X? you should add that to travis.)

IMO docs can be last, since if we were to change this it could remain completely behind the scenes. I guess I don't know how much API you've changed/improved and so what a good migration strategy is.

wesm commented 9 years ago

FWIW, I have often thought about the fact that pandas is turning into a monolithic "kitchen-sink" package, and it might make sense to component-ize a bit (data readers/writers is an obvious one) so that folks can get new data sources without having to deal with upgrading their code bases to account for little API changes (typically where they were accidentally relying on some undocumented, unspecified, or untested) behavior that breaks their code. Now, that's happening less and less now. And nothing to stop you from having all the DataFrame.to_* methods, though perhaps that's starting to get a bit busy =)

femtotrader commented 9 years ago

@hayd my lib won't pass the test because in my lib DataReader is an object with a get method I've never test with Python 3. see for example https://github.com/femtotrader/pandas_datareaders/blob/master/tests/test_datareader_google_finance_daily.py So we should forget my lib for now. My idea was to have a DataReader object which could have a expire_after parameter defined at __init__.

An other example of my DataReader object is here https://github.com/femtotrader/pandas_datareaders/blob/master/pandas_datareaders/datareaders/openexchangerates.py https://github.com/femtotrader/pandas_datareaders/blob/master/tests/test_datareader_openexchangerates.py there is a get method to get currency exchange rate matrix but also a convert method to convert currencies

Maybe that wasn't a good idea... maybe we should still keep DataReader a function (and have all the current code not modified). We should just move current pandas code and doc first.

In a second step, we will be able to add cache mechanism anyway passing session parameter to DataReader function (this parameters being defined as None by default) so it should be backward-compat.

import requests_cache
session = requests_cache.CachedSession(cache_name='cache', backend='sqlite', expire_after=60*60)
df = web.DataReader(symbol, 'google', start, end, session)

What is your opinion about it ?

hayd commented 9 years ago

Ok, let's do this, theres a been 2 patches since I migrated it a month ago, so will do the following.

Then @femtotrader add in what you've learned making pandas_datareaders (hopefully with minimal/no breaking API changes). Caching + sessions sounds great!

I'm not sure we can use pandas_datareaders as the pypi name as that conflicts with your package (and will break code for people... including yours!) What about just pandas_data or pd_datareaders?

femtotrader commented 9 years ago

You can take the project on Pypi it doesn't matter for me

we just need to use a version number > 0.0.2 for the official pandas_datareader.

My (very few) users will have to use version 0.0.2 of my DataReader or wait for 0.2 pandas official datareader (because there will be no cache in 0.1)

femtotrader commented 9 years ago

I've just add @wesm and @jreback as owner of https://pypi.python.org/pypi/pandas_datareaders

jorisvandenbossche commented 9 years ago

On a name, I more like "pandas-datareader" instead of "pandas_datareader", and then eg just datareader to import. But, of course, just a matter of taste ..

jorisvandenbossche commented 9 years ago

@hayd What do you mean with "make similar lib for compat"?

And with "Replace on pandas"? -> I think we jsut leave it as it is for now, and if everyting is in place just put a deprecation warning on it? (and then remove in a more future release)

On "Get datareaders working on older pandas...", I think up to 0.11 (it is now?) is already very nice (if it would be much effort)

hayd commented 9 years ago

@jorisvandenbossche sorry current failure is 0.12- (see https://travis-ci.org/hayd/pandas_data_readers/builds/43218562) it's basically just pandas.compat not having certain functions back then - if we make pandas-compat a sep package we wouldn't have that issue (and I think it would work on 0.10+ for free).

I don't see the reason to not just include it, what's the value in deprecate/remove?

jreback commented 9 years ago

I agree with joris here

I will setup pandas-datareader later in pydata

then I would suggest that the first version just be the exact version that pandas has

if someone wants to modify it later on no problem

a PR to replace the current pandas calls with the new pandas-datareader can be done in 0.16 after the new repo is up tested and released on pypi / conda

hayd commented 9 years ago

@jreback the snapshot I did back in December is exactly that, the work is done (modulo a name change)... so I think we're all in agreement here.

As mentioned there's been two three "datareader" commits since then: https://github.com/pydata/pandas/commit/f88f15b60d4b50391b8a188cf05bbc4643d4f494 https://github.com/pydata/pandas/commit/84b4c2d192c1692b8d037dbb85190fbecd6faba2 and https://github.com/pydata/pandas/commit/b65839d8081ac6830cc09f44f697178817a356d3.

jreback commented 9 years ago

I created this new repo: https://github.com/pydata/pandas-datareader and invited everyone on this issue (and all of the pandas-dev members). You should all now have push/pull access. I'll patch in the travis hook.

jnmclarty commented 9 years ago

because it supports cache

@femtotrader Uncanny timing.

hayd commented 9 years ago

I'm going to force push my repo to master, hope no-one complains too much... can correct (whatever is the correct thing to do) if it's an issue.

@jreback Could you look over the setup.py, change the author etc and push to pypi ? (Also, please can you turn on travis? Edit: you already did!)

I've fixed the tests and updated with recent commits except https://github.com/pydata/pandas/commit/f88f15b60d4b50391b8a188cf05bbc4643d4f494 as it's a (small) API change so wasn't clear what to do there. Should probably just apply it.

femtotrader commented 9 years ago

I renamed my package pandas_datareaders_unofficial. I will delete it on Pypi (my very few users will have to install from source) but I get an Amiga error (Error 503 backend read error - Guru Mediation)

hayd commented 9 years ago

Needs some readme + docs/ readthedocs love...

I totally missed ga.py and gbq.py, should these be included, should anything else?

jorisvandenbossche commented 9 years ago

I can look at the docs if you want.

But some other things to discuss:

hayd commented 9 years ago

@jorisvandenbossche No I was thinking of it as a dependancy, e.g. have pandas.io.data = pandas_datareader.data, these can/would be updated by the user. That way no need to deprecate/break code (but still we remove the code from pandas).

These are packages parts of pandas which periodically break due to external API changes, I kind of think they (data wb ga gbq) belong together (though share the concern about making it to big, maybe they can be separate - I don't think it matters too much, they're pretty stable atm).

Similarly I think extracting out pandas.compat could be useful for other packages (which depend on pandas).

Edit: Thinking about it, I agree gbq doesn't belong in pandas_datareader. But I think it could easily live in a sep package.

femtotrader commented 9 years ago

In my mind ga.py and gbq.py should also be part of pandas-datareader.

I also think that pandas.io.data = pandas_datareader.data and having datareader a dependency is a good idea so it won't break any code for now (but I'm not a specialist)

jreback commented 9 years ago

all of the modules moved to pandas_datareader are pulling from named web-sources. I could buy ga.py, as this is similar, though it pulls from 'private' sources.

gbq.py OTOH is much more like sql/hdf/excel where it provides a 2-way connection to this data. Yes it is web-based but that is incidental.

jorisvandenbossche commented 9 years ago

agree with @jreback on gbq.py, which is what I tried to say above. There is a clear difference between reading in existing remote data sources and write/read to/from a storage place (although it is remote).

About having pandas.io.data = pandas_datareader.data and then having pandas_datareader as a dependency. I am not fully convinced on this, some concerns:

jreback commented 9 years ago

see issue here: https://github.com/pydata/pandas-datareader/issues/15

in a nutshell. I think its important that the first release (say 0.1) of pandas-datareader and 0.16.0 be basically identical. Then pandas-datareader can update at its own pace. And at somepoint pandas will not ship the data reader stuff and instead show a nice error message to import pandas-datareader (say 0.17.0)

hayd commented 9 years ago

@jreback any reason for pandas not to do the import for you going forward?

pandas.io.web = data_reader.web

Either as a dependency or as a soft dependency (e.g. throw an error that you need to pip install pandas-datareader to use pandas.io.web) ? (One aim of pandas-datareader is to be backwards compatible, and work on older version of pandas. )

jreback commented 9 years ago

I think a soft dep is fine but iirc the idea was to basically leave the code completely as is till say 0.17 then change it?

hayd commented 9 years ago

@jreback sounds good!

jorisvandenbossche commented 9 years ago

Another question: what about the docs? (I started with that https://github.com/pydata/pandas-datareader/pull/18, but was not fully sure about the path to follow)

Should they also stay in the pandas docs? Or refer to the pandas-datareader docs? Wich import do we use there? The 'real' pandas_datareader or pandas.io.data one? In any case, pandas-datareader itself should have docs, but having a duplicate version in pandas is not a good way to go I think. I still find that keep on using pandas.io.data as the import path in the long run (also when the code is removed from the pandas codebase itself) will only cause confusion.

jreback commented 9 years ago

closed by #10870