Feature Request: Crazy Idea to gain responsiveness, distribute scrapes, and ease load on yahoo severs

KIC commented 1 year ago

Describe the problem With all these recent issues like Spam bocks and encryption changes, I would like to start a discussion of a solution that eases the pain on the yahoo servers but maybe is a bit crazy.

Describe the solution I am not quite sure yet about the implementation details but I would like to start the discussion about the following idea. I would assume that statistically the same Tickers are queried over and over again by thousands of different users. What if we could build a decentralized cache? Whenever one calls an API we first check the cache then we eventually download missing data and update the shared cache. This would ease the pain on the yahoo servers by a lot and we would gain as a community a fast queryable data source. And on top of that, we might even achieve a more or less complete financial database ideal if you need to build a screener.

Please don't shoot me, I know this is a crazy idea and is far beyond the current yfinance library, and of course, this would be a rather big project. However isn't that the dream of all of us, having a low-latency full local copy? As I don't want to just start such a big project all by myself and I have no other idea where to start such a discussion, I am brave enough to post this here. I assume there is a lot of peer-to-peer software which could be reused (like p2p-python). The distribution of the requests among thousands of IPs could help to not get blocked, and a local cache would ease the load on yahoo servers and allows us to get full sets of data blazing fast.

The rough idea would be that a daemon is needed as a p2p node which:

runs a local sqlite database
every client request is proxied through the local node
local data is queried from the sqlite db
missing local data is fetched from yahoo, added to the database, and sent as message to all the peers (which update their database)
the local and remote data is sent back to the client

And since sqlite implements the MySQL replace into syntax we don't need to worry about race conditions, at least not for a prototype. And maybe we add filters for countries or symbols you are interested in to reduce the local db size. But as I mentioned, the implementation details are not clear yet.

ValueRaider commented 1 year ago

Yahoo probably won't like others distributing their data, to put it mildly.

Rolling your own personal caching isn't too difficult, and some already do.

KIC commented 1 year ago

I do as well, but what is the difference of me and thousands of others downloading everything and storing it, from we download it together and store it. At least all the meta-data like "info", "profile" etc is explicitly allowed to be scraped in the robots.txt and sitemap.xml (from where you could get (allmost?) all possible symbols btw). This would at least leave us with a meta database from where everybody needs to fetch the quotes by himself.

ValueRaider commented 1 year ago

The difference is legal not technological.

you must not reproduce, modify, rent, lease, sell, trade, distribute, transmit, broadcast, publicly perform, create derivative works based on, or exploit for any commercial purposes, any portion or use of, or access to, the Services (including content, advertisements, APIs, and software).

williamc1998 commented 1 year ago

The difference is legal not technological.

you must not reproduce, modify, rent, lease, sell, trade, distribute, transmit, broadcast, publicly perform, create derivative works based on, or exploit for any commercial purposes, any portion or use of, or access to, the Services (including content, advertisements, APIs, and software).

Quite worrisome- does this mean my personal pet non commercial project can't be hosted on the web? It's technically "broadcasting" yahoo data. I'm only really showing share prices though.

KIC commented 1 year ago

And that is why somehow everybody does. And you can find lots of yahoo data on sites like dolthub or quandl (which is now owned by nasdaq). As long as you don't monetize money usually you will be left alone.

But yes maybe this is the wrong way to set this up. Maybe we first need a general-purpose p2p cache and leave it to every individual whether to implement that cache with yf data.

ValueRaider commented 1 year ago

Unless you have explicit written permission, ...

Quandl talk about acquiring access/distribution rights so probably have an agreement with Yahoo. Don't know about dolthub, but I'm not seeing much data hosted there.

I think this GitHub project is the wrong place to discuss your idea, best to separate the two.

williamc1998 commented 1 year ago

Unless you have explicit written permission, ...

Quandl talk about acquiring access/distribution rights so probably have an agreement with Yahoo. Don't know about dolthub, but I'm not seeing much data hosted there.

I think this GitHub project is the wrong place to discuss your idea, best to separate the two.

Is there any chance of yfinance contacting Yahoo about this? Companies always want to look good supporting open source software. They can save face with the commercial use groups by telling the truth - this is like 5x slower.

ranaroussi / yfinance

Feature Request: Crazy Idea to gain responsiveness, distribute scrapes, and ease load on yahoo severs #1439