Open sneko opened 10 months ago
Thanks for reporting!
It's a bit counter-intuitive but I believe the behaviour of isAllowed() -> true
for invalid robots.txt files is correct.
A robots.txt file is part of the Robots Exclusion Protocol. The default behaviour is to assume URLs are allowed unless specifically excluded.
As an invalid robots.txt file doesn't exclude anything, and the default behaviour is to assume allow, then everything should be allowed.
You're right that an invalid robots.txt file is a sign something is misconfigured but I don't think this library can assume misconfigured means disallow. If the file is empty or returns 404 then nothing is excluded so being invalid shouldn't be treated differently.
The draft specification says invalid characters should be ignored but nothing about if the whole file is invalid. However, Google's implementation does specify that if given HTML they will ignore the invalid lines the same as this library.
I have a script to watch multiple robots.txt from websites
Are you using the library to validate the robots.txt files? If so, an isValid()
and/or a getInvalidLines()
method could be added. Every robots.txt parser I'm aware of will ignore invalid lines, but it could be useful for website owners to check that nothing is misconfigured.
Chiming in that I would find an isValid()
method useful. I'm caching robots.txt values in a KV store, but sometimes servers incorrectly return HTML at /robots.txt
(likely an SPA fallthrough to index.html as a 200 rather than 4xx). If I could easily detect whether the response was a true robots.txt format, it would save me $ by preventing an unnecessary KV write op.
For now I'm checking if the first character starts with <
, then it's likely not an actual robots.txt file, but that feels hacky.
Hi @samclarke ,
I have a script to watch multiple
robots.txt
from websites but in some case they have none but still display a fallback content. The issue is your library will tellisAllowed() -> true
even if HTML code is passed.(this test will fail, whereas it should pass, or better, it should throw since there are both
isDisallowed
andisAllowed
)Did I miss something to check the robots.txt format?
Does it make sense to throw an error instead of allowing/disallowing something based on nothing?
Thank you,
EDIT: a workaround could be to check if any HTML inside the file... hoping the website does not return another format (JSON, raw...). But it's a bit hacky, no?
EDIT2: a point of view https://stackoverflow.com/a/31598530/3608410