muratgozel / robotstxt-util

RFC 9309 spec compliant robots.txt builder and parser. 🦾 No dependencies, fully typed.
MIT License
3 stars 1 forks source link

Buggy handling of additional information #2

Closed bart-turczynski closed 4 days ago

bart-turczynski commented 1 month ago

When a robots.txt file contains regular rules and sitemaps, everything works fine. However, issues arise when:

  1. The robots.txt file contains # comments (1) if the comment appears somewhere in the body, we get throw new Error('Each group or rule line must contain a colon.');. If there's a comment at the very top, we get throw new Error('Document must have at least one group starting with "user-agent" at the beginning.');. (I work around this by preprocessing files, but perhaps it would be pertinent to take note of them in the results object).
  2. sitemap: works fine, and sitemaps get added to additional information. However, if the parser encounters host:, or any other element: (e.g., Clean-param: s /forum/index.php) the results aren't reliable. Whatever shows up last in the robots.txt file gets pushed to additional elements.
  3. Also, crawl-delay:seems to be ignored. I know all these aren't part of regular robots.txt files, but (1) it's good to know the crawl-delay for specific bots (Google ignores this, but well-behaved bots will abide), (2) comments are allowed in general, and often help better understand what goes on in the file (and will be ignored if the specific robot doesn't recognize it), (3) Yandex accepts clean-param values, it would be good to learn about them in the parser (much like host or any other element). Technically, bots will ignore elements they aren't familiar with, so I think it would be right to report on them since their presence doesn't break anything, and is valuable to people validating their files. What do you think about this?
muratgozel commented 4 days ago

:tada: This issue has been resolved in version 4.0.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

muratgozel commented 4 days ago

Hi @bart-turczynski sorry for the delay, just wanted to let you know that i have rewritten the library, it's performing much better than the earlier version and includes fixes you mentioned. thanks for bringing all of those important issues. hope it helps.

bart-turczynski commented 4 days ago

Thank you @muratgozel ! I love this library and I appreciate your fix, and more importantly, the fact that you coded the whole thing up to share with others. That's the spirit!