samclarke / robots-parser

NodeJS robots.txt parser with support for wildcard (*) matching.
MIT License
150 stars 19 forks source link

The library should validate the document before processing it #34

Open sneko opened 10 months ago

sneko commented 10 months ago

Hi @samclarke ,

I have a script to watch multiple robots.txt from websites but in some case they have none but still display a fallback content. The issue is your library will tell isAllowed() -> true even if HTML code is passed.

  it('should not confirm it can be indexed', async () => {
    const body = `<html></html>`;

    const robots = robotsParser(robotsUrl, body);
    const canBeIndexed = robots.isAllowed(rootUrl);

    expect(canBeIndexed).toBeFalsy();
  });

(this test will fail, whereas it should pass, or better, it should throw since there are both isDisallowed and isAllowed)

Did I miss something to check the robots.txt format?

Does it make sense to throw an error instead of allowing/disallowing something based on nothing?

Thank you,

EDIT: a workaround could be to check if any HTML inside the file... hoping the website does not return another format (JSON, raw...). But it's a bit hacky, no?

EDIT2: a point of view https://stackoverflow.com/a/31598530/3608410

samclarke commented 10 months ago

Thanks for reporting!

It's a bit counter-intuitive but I believe the behaviour of isAllowed() -> true for invalid robots.txt files is correct.

A robots.txt file is part of the Robots Exclusion Protocol. The default behaviour is to assume URLs are allowed unless specifically excluded.

As an invalid robots.txt file doesn't exclude anything, and the default behaviour is to assume allow, then everything should be allowed.

You're right that an invalid robots.txt file is a sign something is misconfigured but I don't think this library can assume misconfigured means disallow. If the file is empty or returns 404 then nothing is excluded so being invalid shouldn't be treated differently.

The draft specification says invalid characters should be ignored but nothing about if the whole file is invalid. However, Google's implementation does specify that if given HTML they will ignore the invalid lines the same as this library.

I have a script to watch multiple robots.txt from websites

Are you using the library to validate the robots.txt files? If so, an isValid() and/or a getInvalidLines() method could be added. Every robots.txt parser I'm aware of will ignore invalid lines, but it could be useful for website owners to check that nothing is misconfigured.

jasonbarry commented 2 months ago

Chiming in that I would find an isValid() method useful. I'm caching robots.txt values in a KV store, but sometimes servers incorrectly return HTML at /robots.txt (likely an SPA fallthrough to index.html as a 200 rather than 4xx). If I could easily detect whether the response was a true robots.txt format, it would save me $ by preventing an unnecessary KV write op.

For now I'm checking if the first character starts with <, then it's likely not an actual robots.txt file, but that feels hacky.