Closed tituslhy closed 2 weeks ago
The issue with acquiring documents from Investopedia using LlamaIndex's SimpleWebPageReader
is likely due to the absence of specific headers in the requests.get
call. Many websites, including Investopedia, require headers like User-Agent
to be set in the request to return the correct content.
To fix this, you should modify the load_data
method in the SimpleWebPageReader
class to include appropriate headers. Here is an example of how you can do this:
from llama_index.readers.web import SimpleWebPageReader
import requests
class CustomWebPageReader(SimpleWebPageReader):
def load_data(self, urls):
if not isinstance(urls, list):
raise ValueError("urls must be a list of strings.")
documents = []
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
for url in urls:
response = requests.get(url, headers=headers).text
if self.html_to_text:
import html2text
response = html2text.html2text(response)
metadata = None
if self._metadata_fn is not None:
metadata = self._metadata_fn(url)
documents.append(Document(text=response, id_=url, metadata=metadata or {}))
return documents
links = [
"https://www.investopedia.com/terms/s/stockmarket.asp",
"https://www.investopedia.com/ask/answers/difference-between-options-and-futures/",
"https://www.investopedia.com/financial-edge/0411/5-essential-things-you-need-to-know-about-every-stock-you-buy.aspx",
"https://www.investopedia.com/articles/fundamental/04/063004.asp",
"https://www.investopedia.com/terms/t/technicalanalysis.asp"
]
docs = CustomWebPageReader(
html_to_text=True
).load_data(urls=links)
This modification sets the User-Agent
header, which should help in retrieving the correct content from Investopedia [1].
To continue talking to Dosu, mention @dosu.
@dosu This approach yields something but there are many unnecessary information such as
(https://www.investopedia.com/personal-loans-4689729)\n * [ Insurance ](https://www.investopedia.com/insurance-4427716)\n * [ Mortgages ](https://www.investopedia.com/mortgage-4689703)\n * [ Credit and Debt ](https://www.investopedia.com/credit-and-debt-4689724)\n * [ Student Loans ](https://www.investopedia.com/student-loans-4689727)\n * [ Taxes ](https://www.investopedia.com/taxes-4427724)\n * [ Credit Cards ](https://www.investopedia.com/credit-card-4689721)\n * [ Financial Literacy ](https://www.investopedia.com/financial-literacy-resource-center-7151950)\n
Is there a way to remove all the (URL)[text] when ingesting the link?
Yes, you can remove all the * (URL)[text] *
patterns when ingesting the link by modifying the _remove_links
method in the SimpleWebPageReader
class. Here is an updated version of the method that includes a regular expression to match and remove this specific pattern:
import re
from llama_index.readers.web import SimpleWebPageReader
import requests
class CustomWebPageReader(SimpleWebPageReader):
def _remove_links(self, string) -> str:
"""Removes all URLs and * (URL)[text] * patterns from a given string."""
# Pattern to match * (URL)[text] *
pattern = r"\*\s*\(https?://[^\)]+\)\[.*?\]\s*\*"
# Remove the matched patterns
string = re.sub(pattern, "", string)
# Existing URL removal logic
def replace_match(match):
text = match.group(1)
return text if text else ""
url_pattern = r"https?://(?:www\.)?((?!www\.).)+?"
return re.sub(url_pattern, replace_match, string)
def load_data(self, urls):
if not isinstance(urls, list):
raise ValueError("urls must be a list of strings.")
documents = []
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
for url in urls:
response = requests.get(url, headers=headers).text
if self.html_to_text:
import html2text
response = html2text.html2text(response)
response = self._remove_links(response)
metadata = None
if self._metadata_fn is not None:
metadata = self._metadata_fn(url)
documents.append(Document(text=response, id_=url, metadata=metadata or {}))
return documents
links = [
"https://www.investopedia.com/terms/s/stockmarket.asp",
"https://www.investopedia.com/ask/answers/difference-between-options-and-futures/",
"https://www.investopedia.com/financial-edge/0411/5-essential-things-you-need-to-know-about-every-stock-you-buy.aspx",
"https://www.investopedia.com/articles/fundamental/04/063004.asp",
"https://www.investopedia.com/terms/t/technicalanalysis.asp"
]
docs = CustomWebPageReader(
html_to_text=True
).load_data(urls=links)
This modification ensures that the * (URL)[text] *
patterns are removed from the content before it is processed further [1].
To continue talking to Dosu, mention @dosu.
Bug Description
I tried using LlamaIndex's SimpleWebPageReader to get documents from Investopedia, but the documents acquired from ".load_data()" was incorrect.
Version
10.50.52
Steps to Reproduce
Relevant Logs/Tracbacks