lvxhnat / pyetfdb-scraper

Scrape ETF data for free - the unofficial python wrapper for ETFDB
https://pypi.org/project/pyetfdb-scraper/
GNU General Public License v3.0
47 stars 5 forks source link

403 Error when attempting to use spy = ETF('SPY') #12

Open Cerebex opened 2 months ago

Cerebex commented 2 months ago

Describe the bug 403 Error when attempting to use spy = ETF('SPY')

To Reproduce Steps to reproduce the behavior:

from pyetfdb_scraper.etf import ETF, load_etfs
spy = ETF('SPY')

Expected behavior Pull information properly

Additional context It times out and waits 15 minutes but does not fix the issue.

lvxhnat commented 2 months ago

Hi @Cerebex, I had a look deeper into this issue, and it seems like VettaFi are now using JavaScript-based checks to verify that requests are not coming from bots. This can probably be solved by using Selenium to retrieve the page source, but I am quite busy these few days, so it will take me a while to get to.

Will keep posted when a fix is pushed.

Cerebex commented 2 months ago

Really appreciate it. I found I could get it to work with selenium, as you stated, but only when I physically opened the browser which is not ideal.

lvxhnat commented 2 months ago

@Cerebex If I am not wrong, you can run Selenium in headless mode. Are you saying it doesn't work when you do that? Either way, it will be great if you can share that code to help fix this issue. It will be great help to get a load off my back! :)

lvxhnat commented 1 month ago

Update: I will get this solved sometime in November/December. Apologies to whoever is using this package, but I do not have the time now to work on this.

GentlemanXR commented 1 month ago

Using this as a guide: https://www.zenrows.com/blog/403-web-scraping#set-fake-user-agent This is not my area of expertise, so I'm not sure if it's a permanent fix. Seems to work for me thus far. `# etf_scraper.py class ETFScraper(object):

def __init__(
    self, 
    ticker: str,
    user_agent: str = None,
):

    self.ticker = ticker
    self.base_url: str = "https://etfdb.com/etf"

    self.user_agents = load_user_agents()
    self.request_headers: dict = {
        ######
        # "User-Agent": user_agent if not user_agent else random.choice(self.user_agents),    <------- replace this line with the line below
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36",,
        "Referer": "https://etfdb.com/etfs/QQQ"
    }
    self.scrape_url: str = f"{self.base_url}/{ticker}"

    soup = self.__request_ticker()

    self.etf_ticker_body_soup = soup.find("div", {"id": "etf-ticker-body"})`