[Self-Host] Is it possible to resolve local html files?

mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

https://firecrawl.dev

GNU Affero General Public License v3.0

19.13k stars 1.48k forks source link

[Self-Host] Is it possible to resolve local html files? #896

Closed Taimin closed 1 week ago

Taimin commented 1 week ago

Describe the Issue I have already downloaded multiple html files. When I try to use a local firecrawl to transform these files into markdown, I got an empty dictionary in 'data'.

To Reproduce

app = FirecrawlApp(api_key="1", api_url="http://localhost:3002/")

crawl_status = app.crawl_url(
  local_html_file_path, 
  params={
    'scrapeOptions': {'formats': ['rawHtml', ]}
  }
)
print(crawl_status)

Expected Behavior The whole html file should be converted to a markdown file.

Screenshots The data is empty.

hemeda3 commented 1 week ago

UP! same here, but iam using the cloud api, I have the files locally, dont want to upload them publicly, want to send file from local to firecrawl server from my laptop

mogery commented 1 week ago

You can do this by hosting the file locally, with python for example: python3 -m http.server

After that, you can point your crawl at http://localhost:8000/<file path relative to the directory you started the server in>

mogery commented 1 week ago

UP! same here, but iam using the cloud api, I have the files locally, dont want to upload them publicly, want to send file from local to firecrawl server from my laptop

We do not support this.

hemeda3 commented 1 week ago

@mogery yes I am aware of this, just wanted to find a trick helping me avoid uplading my private files publicly, at the same time using your cloud API

You can do this by hosting the file locally, with python for example: python3 -m http.server

After that, you can point your crawl at http://localhost:8000/<file path relative to the directory you started the server in>

Thanks, this is super helpful, with NGROk could expose my file publicly but for limited time until it get parsed not very bad solution

Taimin commented 1 week ago

You can do this by hosting the file locally, with python for example: python3 -m http.server

After that, you can point your crawl at http://localhost:8000/<file path relative to the directory you started the server in>

Thanks. It works. Firecrawl needs proper http response in order to read in html files.