unclecode / crawl4ai

πŸ”₯πŸ•·οΈ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.58k stars 1.23k forks source link

Error: name 'CustomHTML2Text' is not defined #259

Closed Jeferson100 closed 1 week ago

Jeferson100 commented 1 week ago

I was testing the notebook from the video on my machine and encountered the following error: [ERROR] 🚫 Failed to crawl https://www.eu-startups.com/directory/, error: name 'CustomHTML2Text' is not defined

from crawl4ai import WebCrawler

# Create an instance of WebCrawler
crawler = WebCrawler()

[LOG] πŸš€ Initializing LocalSeleniumCrawlerStrategy

# Warm up the crawler (load necessary models)
crawler.warmup()

[LOG] 🌀️  Warming up the WebCrawler
[LOG] 🌞 WebCrawler is ready to crawl

# Run the crawler on a URL
result = crawler.run(
    url = "https://www.eu-startups.com/directory/"
)

# Print the extracted content
print(result.markdown)

[LOG] πŸš€ Crawling done for https://www.eu-startups.com/directory/, success: True, time taken: 4.21 seconds
[ERROR] 🚫 Failed to crawl https://www.eu-startups.com/directory/, error: name 'CustomHTML2Text' is not defined
None
Python version: 3.10.11
System: Windows
Machine: AMD64
Platform: Windows-10-10.0.22631-SP0
Processor: Intel64 Family 6 Model 141 Stepping 1, GenuineIntel
Python build: ('tags/v3.10.11:7d4cc5a', 'Apr  5 2023 00:38:17')
Package                   Version
------------------------- -----------
aiohappyeyeballs          2.4.3
aiohttp                   3.10.10
aiosignal                 1.3.1
aiosqlite                 0.20.0
altair                    5.4.1
annotated-types           0.7.0
anyio                     4.6.2.post1
asgiref                   3.8.1
asttokens                 2.4.1
async-timeout             4.0.3
attrs                     24.2.0
Automat                   24.8.1
backoff                   2.2.1
beautifulsoup4            4.12.3
blinker                   1.8.2
Brotli                    1.1.0
cachetools                5.5.0
certifi                   2024.8.30
cffi                      1.17.1
chardet                   5.2.0
charset-normalizer        3.4.0
click                     8.1.7
colorama                  0.4.6
comm                      0.2.2
constantly                23.10.4
Crawl4AI                  0.3.73
cryptography              43.0.3
cssselect                 1.2.0
dataclasses-json          0.6.7
debugpy                   1.8.7
decorator                 5.1.1
defusedxml                0.7.1
deprecation               2.1.0
distro                    1.9.0
emoji                     2.14.0
eval_type_backport        0.2.0
exceptiongroup            1.2.2
executing                 2.1.0
faiss-cpu                 1.9.0
fake-http-header          0.3.5
fastapi                   0.115.4
filelock                  3.16.1
filetype                  1.2.0
Flask                     3.0.3
frozenlist                1.5.0
fsspec                    2024.10.0
gitdb                     4.0.11
GitPython                 3.1.43
greenlet                  3.0.3
groq                      0.11.0
h11                       0.14.0
html2text                 2024.2.26
html5lib                  1.1
httpcore                  1.0.6
httpx                     0.27.2
httpx-sse                 0.4.0
huggingface-hub           0.26.2
hyperlink                 21.0.0
idna                      3.10
importlib_metadata        8.5.0
incremental               24.7.2
iniconfig                 2.0.0
ipykernel                 6.29.5
ipython                   8.29.0
itemadapter               0.9.0
itemloaders               1.3.2
itsdangerous              2.2.0
jedi                      0.19.1
Jinja2                    3.1.4
jiter                     0.7.1
jmespath                  1.0.1
joblib                    1.4.2
jsonpatch                 1.33
jsonpath-python           1.0.6
jsonpointer               3.0.0
jsonschema                4.23.0
jsonschema-specifications 2024.10.1
jupyter_client            8.6.3
jupyter_core              5.7.2
kagglehub                 0.3.3
lancedb                   0.15.0
langchain                 0.3.7
langchain-community       0.3.5
langchain-core            0.3.15
langchain-groq            0.2.1
langchain-text-splitters  0.3.2
langdetect                1.0.9
langsmith                 0.1.140
litellm                   1.52.6
loguru                    0.7.2
lxml                      5.3.0
markdown-it-py            3.0.0
MarkupSafe                3.0.2
marshmallow               3.23.1
matplotlib-inline         0.1.7
mdurl                     0.1.2
mockito                   1.5.3
mpmath                    1.3.0
multidict                 6.1.0
mypy-extensions           1.0.0
narwhals                  1.13.2
nest-asyncio              1.6.0
networkx                  3.4.2
ngrok                     1.4.0
nltk                      3.9.1
numpy                     1.26.4
olefile                   0.47
openai                    1.54.4
orjson                    3.10.11
outcome                   1.3.0.post0
overrides                 7.7.0
packaging                 24.1
pandas                    2.2.3
parsel                    1.9.1
parso                     0.8.4
pillow                    10.4.0
pip                       23.0.1
platformdirs              4.3.6
playwright                1.47.0
playwright-stealth        1.0.6
pluggy                    1.5.0
prompt_toolkit            3.0.48
propcache                 0.2.0
Protego                   0.3.1
protobuf                  5.28.3
psutil                    6.1.0
pure_eval                 0.2.3
pyarrow                   18.0.0
pyasn1                    0.6.1
pyasn1_modules            0.4.1
pycparser                 2.22
pydantic                  2.9.2
pydantic_core             2.23.4
pydantic-settings         2.6.1
pydeck                    0.9.1
PyDispatcher              2.0.7
pyee                      12.0.0
Pygments                  2.18.0
pylance                   0.19.1
pyOpenSSL                 24.2.1
pypdf                     5.1.0
PySocks                   1.7.1
pytest                    8.3.3
pytest-mockito            0.0.4
python-dateutil           2.8.2
python-dotenv             1.0.1
python-iso639             2024.10.22
python-magic              0.4.27
python-oxmsg              0.0.1
pytz                      2024.2
pywin32                   308
PyYAML                    6.0.2
pyzmq                     26.2.0
queuelib                  1.7.0
RapidFuzz                 3.10.1
referencing               0.35.1
regex                     2024.9.11
requests                  2.32.3
requests-file             2.1.0
requests-toolbelt         1.0.0
rich                      13.9.4
rpds-py                   0.20.1
safetensors               0.4.5
scikit-learn              1.5.2
scipy                     1.14.1
Scrapy                    2.11.2
selenium                  4.26.1
sentence-transformers     3.2.1
service-identity          24.2.0
setuptools                65.5.0
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.1
sortedcontainers          2.4.0
soupsieve                 2.6
SQLAlchemy                2.0.35
stack-data                0.6.3
starlette                 0.41.2
streamlit                 1.39.0
sympy                     1.13.1
tavily-python             0.5.0
tenacity                  9.0.0
tf-playwright-stealth     1.0.3
threadpoolctl             3.5.0
tiktoken                  0.8.0
tldextract                5.1.3
tokenizers                0.20.3
toml                      0.10.2
tomli                     2.1.0
torch                     2.5.1
tornado                   6.4.1
tqdm                      4.67.0
traitlets                 5.14.3
transformers              4.46.2
trio                      0.27.0
trio-websocket            0.11.1
Twisted                   24.10.0
typing_extensions         4.12.2
typing-inspect            0.9.0
tzdata                    2024.2
unstructured              0.16.5
unstructured-client       0.27.0
urllib3                   2.2.3
uvicorn                   0.29.0
w3lib                     2.2.1
watchdog                  5.0.3
wcwidth                   0.2.13
webencodings              0.5.1
websocket-client          1.8.0
Werkzeug                  3.1.1
win32-setctime            1.1.0
wrapt                     1.16.0
wsproto                   1.2.0
yarl                      1.17.1
zipp                      3.21.0
zope.interface            7.1.1
cokecanpapi commented 1 week ago

I'm facing the face issue, hopefully it is resolved soon.

unclecode commented 1 week ago

@Jeferson100 @cokecanpapi Thx for trying the library. The error you are facing is because you're using the synchronous version of Crawl4ai which we no longer maintain it, however this bug is temporarily fixed for those who still want to use it and and will be available in version 0.3.74, which will be released today or tomorrow. However we no longer support the sync version. Please switch to the asynchronous version, as it's much faster. We may deprecate the synchronous version soon.

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    # Create an instance of AsyncWebCrawler
    async with AsyncWebCrawler(verbose=True) as crawler:
        # Run the crawler on a URL
        result = await crawler.arun(url="https://www.nbcnews.com/business")

        # Print the extracted content
        print(result.markdown)

# Run the async main function
asyncio.run(main())
Jeferson100 commented 1 week ago

Thank you, it worked.

cokecanpapi commented 1 week ago

Thank you!

unclecode commented 6 days ago

You all welcome, btw I assum the sync version also is fixed.