run-llama / llama-hub

A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChain
https://llamahub.ai/
MIT License
3.44k stars 729 forks source link

Substack scraper not working #60

Closed kektobiologist closed 1 year ago

kektobiologist commented 1 year ago

Following the example given here https://llamahub.ai/l/web-beautiful_soup_web, when I try to use the _substack_reader method to parse a substack:

from llama_index import download_loader

BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")

loader = BeautifulSoupWebReader()
documents = loader.load_data(urls=['https://ijeomaoluo.substack.com/p/healing-isnt-easy'], custom_hostname="substack.com")

I get the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".modules/web-beautiful_soup_web.py", line 136, in load_data
    data, metadata = self.website_extractor[hostname](soup, url)
TypeError: _substack_reader() takes 1 positional argument but 2 were given

This looks like a regression introduced in this commit: https://github.com/emptycrown/llama-hub/commit/23ae4928cb00934e789dce44eba2159322518c18 Would just have to fix method signature to have 2 arguments instead of 1 to fix it.

emptycrown commented 1 year ago

Good call! I’ll push a fix in a bit.

On Fri, Feb 24, 2023 at 6:18 AM Arpit Tarang Saxena < @.***> wrote:

Following the example given here https://llamahub.ai/l/web-beautiful_soup_web, when I try to use the _substack_reader method to parse a substack:

from llama_index import download_loader

BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")

loader = BeautifulSoupWebReader() documents = loader.load_data(urls=['https://ijeomaoluo.substack.com/p/healing-isnt-easy'], custom_hostname="substack.com")

I get the error:

Traceback (most recent call last): File "", line 1, in File ".modules/web-beautiful_soup_web.py", line 136, in load_data data, metadata = self.website_extractor[hostname](soup, url) TypeError: _substack_reader() takes 1 positional argument but 2 were given

This looks like a regression introduced in this commit: 23ae492 https://github.com/emptycrown/llama-hub/commit/23ae4928cb00934e789dce44eba2159322518c18 Would just have to fix method signature to have 2 arguments instead of 1 to fix it.

— Reply to this email directly, view it on GitHub https://github.com/emptycrown/llama-hub/issues/60, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDLZCG77ITWBNGOVFN4W7DWZC7KXANCNFSM6AAAAAAVG6O37U . You are receiving this because you are subscribed to this thread.Message ID: @.***>

emptycrown commented 1 year ago

Fixed here.