pgaref / HTTP_Request_Randomizer

Proxying Python Requests
http://pgaref.com/blog/python-proxy/
MIT License
151 stars 59 forks source link

How to use with Pandas Datareader? #53

Closed windowshopr closed 5 years ago

windowshopr commented 5 years ago

Would love some input on how to make that work, specifically when using DataReader and the Yahoo Finance API to get stock data. I can make requests for stock data using the DataReader once, and then after that I get an error, until the next day. My code looks simple:

import pandas as pd
import pandas_datareader.data as web
from datetime import datetime
from time import sleep
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy

# list of stock's data to get
stocks = [

'USO',
'FXI',
'EEM'

]

start = datetime(2018, 9, 19)
end = datetime.now()

for i in stocks:
    sleep(5)

    try:
        data = web.DataReader(i, 'yahoo', start, end)
        df = data[['Date','Open','High','Low','Close','Adj Close','Volume']]
        # Round some of the results to 2 decimal places, then save .csv file
        df = df.round({"Open":2, "High":2, "Low":2, "Close":2, "Adj Close":2})
        df.to_csv(str(i) + '.csv')
        print('Successfully downloaded ' + str(i))
        continue

    except:
        print('Failed to download ' + str(i))
        continue

So how could one integrate the http randomizer into that? I tried playing around with it a bit but couldn't figure it out. Something like replacing the url used in the request with the datareader somehow? If that makes sense?

windowshopr commented 5 years ago

As a follow up, I found this that might be helpful to someone who would know how to implement it?

https://stackoverflow.com/questions/53946083/setting-a-proxy-for-pandas-datareader

pgaref commented 5 years ago

Hello @windowshopr, thanks for opening the issue -- what you are describing seems pretty doable. I implemented #55 as a POC where proxied sessions are created instead of requests that can be used directly with pandas. Let me know if thats what you had in mind.

windowshopr commented 5 years ago

@pgaref That looks exactly like what I'm after!

I am running into an issue though with the line:

self.userAgent = UserAgentManager(file=os.path.join(os.path.dirname('__file__'), './user_agents.txt'))

...which gives me an error:

TypeError: __init__() got an unexpected keyword argument 'file'

...so should I just take that out and leave UserAgentManager() as an empty function? Thanks!

pgaref commented 5 years ago

Hey @windowshopr -- the path should be relative so it should be something like: https://github.com/pgaref/HTTP_Request_Randomizer/blob/78b305a3440f33cfd0caddb8ddf41b5eea974c68/http_request_randomizer/requests/proxy/requestProxy.py#L38

If you use the default (empty) UserAgentManager constructor, it will use the fake-useragent which is also fine (you will notice some log.warn messages)

Other than that let me know if you face any other issues and I can add the pandas-session functionality with some tests in the next release.

windowshopr commented 5 years ago

Right on! I was able to get it working. Thanks a lot for the help!