temoto / robotstxt

The robots.txt exclusion protocol implementation for Go language
MIT License
269 stars 55 forks source link

Fix defaults to for valid files #30

Closed DoryGuy closed 2 years ago

DoryGuy commented 3 years ago

Hi, We use the JSON to store the parsed version of the data for multiple sources which look at robots.txt vs parsing everytime.

Many users of robots.txt don't appear to read the standard. Our goal is to try and do what they intended. ie, no UserAgent, therefore they intended this rule to apply to all user agents.

I ran go fmt on these changes.

temoto commented 3 years ago

Hi. Thank you for interest and contribution.

JSON serialization looks useful in a debugging scenario? Could you share big picture how you use it?

The (best until official) standard http://www.robotstxt.org/orig.html says:

The file consists of one or more records separated by one or more blank lines The record starts with one or more User-agent lines, followed by one or more Disallow lines

So "disallow before user-agent" is invalid robots.txt file, but I agree with logic here. Maybe extended compatibility mode is in order. Or you can just prepend user-agent: *\n string before every robotstxt content. Robotstxt content inside HTML is also invalid and should not be parsed.

Please run go fmt in pre-commit hook.

temoto commented 2 years ago

You expect JSON parsing to be faster or why second storage format?

DoryGuy commented 2 years ago

You expect JSON parsing to be faster or why second storage format?

Yes, it's in general way faster to parse JSON than a large robots.txt file.

temoto commented 2 years ago

@DoryGuy would you please share a link to few large robots.txt files? I'd include them into benchmark tests.

DoryGuy commented 2 years ago

@DoryGuy would you please share a link to few large robots.txt files? I'd include them into benchmark tests.

https://www.allflyingjobs.com/robots.txt Is a good one. https://www.salonsdirect.com/robots.txt https://qgear.es/robots.txt https://www.officedepot.com/robots.txt

I added unit tests for all the reasons that files I thought should pass, didn't. And then fixed the code so they would. I don't need a robot.txt checker, I can't fix the errors in other folks files.

The speed advantage to saving the data as JSON vs the raw file is that no need to run regex's (expensive).

temoto commented 2 years ago

FromString("<!DOCTYPE html> this test must fail, no HTML parsing in this project.

temoto commented 2 years ago

The speed advantage to saving the data as JSON vs the raw file is that no need to run regex's (expensive).

But we only run regex comparisons against URL path after robots.txt is parsed. We don't need regex to parse robots.txt because of custom scanner.

DoryGuy commented 2 years ago

FromString("<!DOCTYPE html> this test must fail, no HTML parsing in this project.

It doesn't fail for us, because we toss ALL html tags with the code mods I added. Some of the stupid robots.txt files have html crap in them. Then followed by valid robots.txt. It's like they used an HTML editor which embeded this stuff in the file and the person saving it didn't realize or care or know.

For us, it's more important to just toss stuff that that's obviously garbage, and then try to parse the rest if any.

DoryGuy commented 2 years ago

The speed advantage to saving the data as JSON vs the raw file is that no need to run regex's (expensive).

But we only run regex comparisons against URL path after robots.txt is parsed. We don't need regex to parse robots.txt because of custom scanner.

You are confusing where I added the regex's. I use them to remove the garbage from the file. ie Allow : /. (does not pass your parser)

Will be regex'd into Allow: / (does pass)

And AFIK, there's no reason not to do this.

DoryGuy commented 2 years ago

Oh yeah, if the two pull requests are an issue, just take the otherone. It has all of these changes, plus the rest that make the whole project better.

temoto commented 2 years ago

Allow : /. enables access to paths starting with dot, like /.well-known/ how does it not pass?

temoto commented 2 years ago

You are confusing where I added the regex's. I use them to remove the garbage from the file.

You can skip garbage while parsing, it removes runtime regex, that solves speed and that removes json. Right?

temoto commented 2 years ago

Oh yeah, if the two pull requests are an issue, just take the otherone. It has all of these changes, plus the rest that make the whole project better.

Issue is unbounded scope of single pull request. One PR must solve exactly one small issue/task.

I'm happy to work on any reasonable way to ease useful pre-conditioning of input, but it must be on opt-in basis. Similar code may be posted in readme as common practice:

var fixmeDontUseRegexToParseHTML = re.MustCompile(`...`)

func RobotsFromWild(b []byte) (...) {
    b = fixmeDontUseRegexToParseHTML.ReplaceAll(b, []byte("$1"))
    b = append([]byte("user-agent: *\n"), b...)
    return robotstxt.FromBytes(b)
}

or maybe hooks

robotstxt.AddContentHook(myStripHTML)
robotstxt.AddContentHook(myAddDefaultAgent)
r, err := robotstxt.FromBytes/String/Response(input)

or maybe extended options

r, err := robotstxt.From({
  Bytes/String/Response: input,
  Option: robotstxt.OptStripHTML | robotstxt.OptDefaultAllAgents,
})
DoryGuy commented 2 years ago

Ok, I'm withdrawing this pull request. We'll just fork and go with it. Thanks for the baseline.