indrajithi / tiny-web-crawler

A simple and easy to use web crawler for Python
MIT License
55 stars 11 forks source link

Feature: Add option to return the crawled website body in the response #8

Closed indrajithi closed 1 week ago

indrajithi commented 2 weeks ago

Currently we do not return the html body from the crawled sites. We only returns the links we find.

Eg:

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ],
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ]
      }
    }
}

This is a feature to return the html body as well. And the result should look look like this.

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ]
        "body": "<html>stuff</html>",
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ],
         "body": "<html>other stuff</html>",
      }
    }
}
devavinothm commented 2 weeks ago

i have solved this issue. Please check it out: https://github.com/indrajithi/tiny-web-crawler/pull/14#issue-2354558299

indrajithi commented 2 weeks ago

@devavinothm I think you meant this issue.

Mews commented 1 week ago

Can I be assigned this?