spider-rs / spider

A web crawler and scraper for Rust
https://spider.cloud
MIT License
1.11k stars 96 forks source link

Retrieve crawled markdown via API #211

Closed culda closed 2 months ago

culda commented 2 months ago

I did a crawl using the dashboard and it ran until credits ran out. I want to query the content that was scraped but not sure how.

{'id': 'dbae218e-753a-489a-8893-66eb65b85fa3', 'user_id': '5efa2ec1-4bc0-4047-bc4f-1901ef695ee6', 'url': '5efa2ec1-4bc0-4047-bc4f-1901ef695ee6/www.gov.uk/www_uk/10002090080147424289.md', 'domain': 'www .gov.uk', 'created_at': '2024-09-01T11:25:34.315714+00:00', 'updated_at': '2024-09-01T11:25:34.315714+00:00', 'pathname': '/world/travelling-to-the-democratic-republic-of-the-congo', 'fts': "'/world/travelling-t o-the-democratic-republic-of-the-congo':1", 'scheme': 'https:', 'last_checked_at': '2024-09-01T11:25:34.213202+00:00', 'screenshot': False, 'status_code': 200}

I see a URL for the .md file but how do I access it.

Am I supposed to set webhooks for storing data in my own db as the crawl happens?

Thanks