[Bug]: SimpleWebPageReader unable to read ".asp" web pages

tituslhy commented 2 weeks ago

Bug Description

I tried using LlamaIndex's SimpleWebPageReader to get documents from Investopedia, but the documents acquired from ".load_data()" was incorrect.

Version

10.50.52

Steps to Reproduce

from llama_index.readers.web import SimpleWebPageReader

links = [
    "https://www.investopedia.com/terms/s/stockmarket.asp",
    "https://www.investopedia.com/ask/answers/difference-between-options-and-futures/",
    "https://www.investopedia.com/financial-edge/0411/5-essential-things-you-need-to-know-about-every-stock-you-buy.aspx",
    "https://www.investopedia.com/articles/fundamental/04/063004.asp",
    "https://www.investopedia.com/terms/t/technicalanalysis.asp"  
]

docs = SimpleWebPageReader(
    html_to_text = True
).load_data(urls=links)

Relevant Logs/Tracbacks

If you print the docs, you get this (notice that the 'text' is weird):

[Document(id_='https://www.investopedia.com/terms/s/stockmarket.asp', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='', mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 Document(id_='https://www.investopedia.com/ask/answers/difference-between-options-and-futures/', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Signal - Not Acceptable', mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 Document(id_='https://www.investopedia.com/financial-edge/0411/5-essential-things-you-need-to-know-about-every-stock-you-buy.aspx', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Signal - Not Acceptable', mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 Document(id_='https://www.investopedia.com/articles/fundamental/04/063004.asp', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Signal - Not Acceptable', mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 Document(id_='https://www.investopedia.com/terms/t/technicalanalysis.asp', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Signal - Not Acceptable', mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]

dosubot[bot] commented 2 weeks ago

The issue with acquiring documents from Investopedia using LlamaIndex's SimpleWebPageReader is likely due to the absence of specific headers in the requests.get call. Many websites, including Investopedia, require headers like User-Agent to be set in the request to return the correct content.

To fix this, you should modify the load_data method in the SimpleWebPageReader class to include appropriate headers. Here is an example of how you can do this:

from llama_index.readers.web import SimpleWebPageReader
import requests

class CustomWebPageReader(SimpleWebPageReader):
    def load_data(self, urls):
        if not isinstance(urls, list):
            raise ValueError("urls must be a list of strings.")
        documents = []
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
        }
        for url in urls:
            response = requests.get(url, headers=headers).text
            if self.html_to_text:
                import html2text
                response = html2text.html2text(response)

            metadata = None
            if self._metadata_fn is not None:
                metadata = self._metadata_fn(url)

            documents.append(Document(text=response, id_=url, metadata=metadata or {}))

        return documents

links = [
    "https://www.investopedia.com/terms/s/stockmarket.asp",
    "https://www.investopedia.com/ask/answers/difference-between-options-and-futures/",
    "https://www.investopedia.com/financial-edge/0411/5-essential-things-you-need-to-know-about-every-stock-you-buy.aspx",
    "https://www.investopedia.com/articles/fundamental/04/063004.asp",
    "https://www.investopedia.com/terms/t/technicalanalysis.asp"  
]

docs = CustomWebPageReader(
    html_to_text=True
).load_data(urls=links)

This modification sets the User-Agent header, which should help in retrieving the correct content from Investopedia ^[1].

To continue talking to Dosu, mention @dosu.

tituslhy commented 2 weeks ago

@dosu This approach yields something but there are many unnecessary information such as

(https://www.investopedia.com/personal-loans-4689729)\n    * [ Insurance ](https://www.investopedia.com/insurance-4427716)\n    * [ Mortgages ](https://www.investopedia.com/mortgage-4689703)\n    * [ Credit and Debt ](https://www.investopedia.com/credit-and-debt-4689724)\n    * [ Student Loans ](https://www.investopedia.com/student-loans-4689727)\n    * [ Taxes ](https://www.investopedia.com/taxes-4427724)\n    * [ Credit Cards ](https://www.investopedia.com/credit-card-4689721)\n    * [ Financial Literacy ](https://www.investopedia.com/financial-literacy-resource-center-7151950)\n

Is there a way to remove all the (URL)[text] when ingesting the link?

dosubot[bot] commented 2 weeks ago

Yes, you can remove all the * (URL)[text] * patterns when ingesting the link by modifying the _remove_links method in the SimpleWebPageReader class. Here is an updated version of the method that includes a regular expression to match and remove this specific pattern:

import re
from llama_index.readers.web import SimpleWebPageReader
import requests

class CustomWebPageReader(SimpleWebPageReader):
    def _remove_links(self, string) -> str:
        """Removes all URLs and * (URL)[text] * patterns from a given string."""
        # Pattern to match * (URL)[text] *
        pattern = r"\*\s*\(https?://[^\)]+\)\[.*?\]\s*\*"
        # Remove the matched patterns
        string = re.sub(pattern, "", string)

        # Existing URL removal logic
        def replace_match(match):
            text = match.group(1)
            return text if text else ""

        url_pattern = r"https?://(?:www\.)?((?!www\.).)+?"
        return re.sub(url_pattern, replace_match, string)

    def load_data(self, urls):
        if not isinstance(urls, list):
            raise ValueError("urls must be a list of strings.")
        documents = []
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
        }
        for url in urls:
            response = requests.get(url, headers=headers).text
            if self.html_to_text:
                import html2text
                response = html2text.html2text(response)
                response = self._remove_links(response)

            metadata = None
            if self._metadata_fn is not None:
                metadata = self._metadata_fn(url)

            documents.append(Document(text=response, id_=url, metadata=metadata or {}))

        return documents

links = [
    "https://www.investopedia.com/terms/s/stockmarket.asp",
    "https://www.investopedia.com/ask/answers/difference-between-options-and-futures/",
    "https://www.investopedia.com/financial-edge/0411/5-essential-things-you-need-to-know-about-every-stock-you-buy.aspx",
    "https://www.investopedia.com/articles/fundamental/04/063004.asp",
    "https://www.investopedia.com/terms/t/technicalanalysis.asp"  
]

docs = CustomWebPageReader(
    html_to_text=True
).load_data(urls=links)

This modification ensures that the * (URL)[text] * patterns are removed from the content before it is processed further ^[1].

To continue talking to Dosu, mention @dosu.

run-llama / llama_index