t1gor / Robots.txt-Parser-Class

Php class for robots.txt parse
MIT License
83 stars 31 forks source link

Byte limit #75

Open JanPetterMG opened 8 years ago

JanPetterMG commented 8 years ago

Feature request: Limit the maximum number of bytes to parse.

A maximum file size may be enforced per crawler. Content which is after the maximum file size may be ignored. Google currently enforces a size limit of 500 kilobytes (KB).

Source: Google

When forming the robots.txt file, you should keep in mind that the robot places a reasonable limit on its size. If the file size exceeds 32 KB, the robot assumes it allows everything

Source: Yandex

JanPetterMG commented 8 years ago

At the moment, it's possible to generate large (fake or valid) robots.txt files, with the aim to trap the robots.txt crawler, slow down the server, and even cause it to hang or crash.

It's also (depending on the setup) possible to trap the crawler in an infinite retry-loop, if the external code utilizing this library, isn't handling repeating fatal errors correctly...

Related to #62