unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
14.12k stars 989 forks source link

Docker Image #155

Open vikaskookna opened 1 week ago

vikaskookna commented 1 week ago

I created aws lambda docker image, and it fails on this line from crawl4ai import AsyncWebCrawler


  "errorMessage": "[Errno 30] Read-only file system: '/home/sbx_user1051'",
  "errorType": "OSError",
  "requestId": "",
  "stackTrace": [
    "  File \"/var/lang/lib/python3.12/importlib/__init__.py\", line 90, in import_module\n    return _bootstrap._gcd_import(name[level:], package, level)\n",
    "  File \"<frozen importlib._bootstrap>\", line 1387, in _gcd_import\n",
    "  File \"<frozen importlib._bootstrap>\", line 1360, in _find_and_load\n",
    "  File \"<frozen importlib._bootstrap>\", line 1331, in _find_and_load_unlocked\n",
    "  File \"<frozen importlib._bootstrap>\", line 935, in _load_unlocked\n",
    "  File \"<frozen importlib._bootstrap_external>\", line 995, in exec_module\n",
    "  File \"<frozen importlib._bootstrap>\", line 488, in _call_with_frames_removed\n",
    "  File \"/var/task/lambda_function.py\", line 3, in <module>\n    from crawl4ai import AsyncWebCrawler\n",
    "  File \"/var/lang/lib/python3.12/site-packages/crawl4ai/__init__.py\", line 3, in <module>\n    from .async_webcrawler import AsyncWebCrawler\n",
    "  File \"/var/lang/lib/python3.12/site-packages/crawl4ai/async_webcrawler.py\", line 8, in <module>\n    from .async_database import async_db_manager\n",
    "  File \"/var/lang/lib/python3.12/site-packages/crawl4ai/async_database.py\", line 8, in <module>\n    os.makedirs(DB_PATH, exist_ok=True)\n",
    "  File \"<frozen os>\", line 215, in makedirs\n",
    "  File \"<frozen os>\", line 225, in makedirs\n"
  ]
}
akamf commented 1 week ago

I had the same issue and bypassed it by set the DP_PATH to '/tmp/' (the only write-able dir i AWS Lambda) befor importing the crawl4ai-package. My solution:

import os
from pathlib import Path

os.makedirs('/tmp/.crawl4ai', exist_ok=True)
DB_PATH = '/tmp/.crawl4ai/crawl4ai.db'
Path.home = lambda: Path("/tmp")

from crawl4ai import AsyncWebCrawler

Hope this works for you as well.

vikaskookna commented 1 week ago

ok I will try this @akamf Did you create lambda layer or a docker image, when i tried with layer it exceeded 250 MB limit, how did you mange this?

vikaskookna commented 1 week ago

After doing what you mentioned I got this error Error processing https://chatclient.ai: BrowserType.launch: Executable doesn't exist at /home/sbx_user1051/.cache/ms-playwright/chromium-1134/chrome-linux/chrome

akamf commented 1 week ago

I created a Docker image where I installed Playwright and its dependencies and then chromium with playwright. The Docker image is really big though (because of Playwright I guess), so I'm currently working on optimizing it.

But our latest Dockerfile looks like this:

FROM amazonlinux:2 AS build

RUN curl -sL https://rpm.nodesource.com/setup_16.x | bash - && \
    yum install -y nodejs gcc-c++ make python3-devel \
    libX11 libXcomposite libXcursor libXdamage libXext libXi libXtst cups-libs \
    libXScrnSaver pango at-spi2-atk gtk3 iputils libdrm nss alsa-lib \
    libgbm fontconfig freetype freetype-devel ipa-gothic-fonts

RUN npm install -g playwright && \
    PLAYWRIGHT_BROWSERS_PATH=/ms-playwright-browsers playwright install chromium

FROM public.ecr.aws/lambda/python:3.11

WORKDIR ${LAMBDA_TASK_ROOT}

COPY requirements.txt .
RUN pip3 install --upgrade pip && \
    pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}" --verbose

COPY --from=build /usr/lib /usr/lib
COPY --from=build /usr/local/lib /usr/local/lib
COPY --from=build /usr/bin /usr/bin
COPY --from=build /usr/local/bin /usr/local/bin
COPY --from=build /ms-playwright-browsers /ms-playwright-browsers

ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright-browsers

COPY handler.py .

CMD [ "handler.main" ]

I don't know if this is the best solution, but it works for us. Like I said, I'm working on some optimisation for it.

vikaskookna commented 1 week ago

thanks @akamf I tried this but gave me these errors, i'm using m1 mac and built the image using this command

docker build --platform linux/amd64 -t fetchlinks .

var/task/playwright/driver/node: /lib64/libm.so.6: version GLIBC_2.27' not found (required by /var/task/playwright/driver/node) /var/task/playwright/driver/node: /lib64/libc.so.6: version GLIBC_2.28' not found (required by /var/task/playwright/driver/node)

unclecode commented 1 week ago

Hi @vikaskookna @akamf

By the next week, I will create the Docker file and also upload the Docker image to a Docker hub. I hope this can also help you.