Closed DoryGuy closed 2 years ago
Hi. Thank you for interest and contribution.
JSON serialization looks useful in a debugging scenario? Could you share big picture how you use it?
The (best until official) standard http://www.robotstxt.org/orig.html says:
The file consists of one or more records separated by one or more blank lines The record starts with one or more User-agent lines, followed by one or more Disallow lines
So "disallow before user-agent" is invalid robots.txt file, but I agree with logic here. Maybe extended compatibility mode is in order. Or you can just prepend user-agent: *\n
string before every robotstxt content.
Robotstxt content inside HTML is also invalid and should not be parsed.
Please run go fmt
in pre-commit hook.
You expect JSON parsing to be faster or why second storage format?
You expect JSON parsing to be faster or why second storage format?
Yes, it's in general way faster to parse JSON than a large robots.txt file.
@DoryGuy would you please share a link to few large robots.txt files? I'd include them into benchmark tests.
@DoryGuy would you please share a link to few large robots.txt files? I'd include them into benchmark tests.
https://www.allflyingjobs.com/robots.txt Is a good one. https://www.salonsdirect.com/robots.txt https://qgear.es/robots.txt https://www.officedepot.com/robots.txt
I added unit tests for all the reasons that files I thought should pass, didn't. And then fixed the code so they would. I don't need a robot.txt checker, I can't fix the errors in other folks files.
The speed advantage to saving the data as JSON vs the raw file is that no need to run regex's (expensive).
FromString("<!DOCTYPE html>
this test must fail, no HTML parsing in this project.
The speed advantage to saving the data as JSON vs the raw file is that no need to run regex's (expensive).
But we only run regex comparisons against URL path after robots.txt is parsed. We don't need regex to parse robots.txt because of custom scanner.
FromString("<!DOCTYPE html>
this test must fail, no HTML parsing in this project.
It doesn't fail for us, because we toss ALL html tags with the code mods I added. Some of the stupid robots.txt files have html crap in them. Then followed by valid robots.txt. It's like they used an HTML editor which embeded this stuff in the file and the person saving it didn't realize or care or know.
For us, it's more important to just toss stuff that that's obviously garbage, and then try to parse the rest if any.
The speed advantage to saving the data as JSON vs the raw file is that no need to run regex's (expensive).
But we only run regex comparisons against URL path after robots.txt is parsed. We don't need regex to parse robots.txt because of custom scanner.
You are confusing where I added the regex's. I use them to remove the garbage from the file. ie Allow : /. (does not pass your parser)
Will be regex'd into Allow: / (does pass)
And AFIK, there's no reason not to do this.
Oh yeah, if the two pull requests are an issue, just take the otherone. It has all of these changes, plus the rest that make the whole project better.
Allow : /.
enables access to paths starting with dot, like /.well-known/
how does it not pass?
You are confusing where I added the regex's. I use them to remove the garbage from the file.
You can skip garbage while parsing, it removes runtime regex, that solves speed and that removes json. Right?
Oh yeah, if the two pull requests are an issue, just take the otherone. It has all of these changes, plus the rest that make the whole project better.
Issue is unbounded scope of single pull request. One PR must solve exactly one small issue/task.
I'm happy to work on any reasonable way to ease useful pre-conditioning of input, but it must be on opt-in basis. Similar code may be posted in readme as common practice:
var fixmeDontUseRegexToParseHTML = re.MustCompile(`...`)
func RobotsFromWild(b []byte) (...) {
b = fixmeDontUseRegexToParseHTML.ReplaceAll(b, []byte("$1"))
b = append([]byte("user-agent: *\n"), b...)
return robotstxt.FromBytes(b)
}
or maybe hooks
robotstxt.AddContentHook(myStripHTML)
robotstxt.AddContentHook(myAddDefaultAgent)
r, err := robotstxt.FromBytes/String/Response(input)
or maybe extended options
r, err := robotstxt.From({
Bytes/String/Response: input,
Option: robotstxt.OptStripHTML | robotstxt.OptDefaultAllAgents,
})
Ok, I'm withdrawing this pull request. We'll just fork and go with it. Thanks for the baseline.
Hi, We use the JSON to store the parsed version of the data for multiple sources which look at robots.txt vs parsing everytime.
Many users of robots.txt don't appear to read the standard. Our goal is to try and do what they intended. ie, no UserAgent, therefore they intended this rule to apply to all user agents.
I ran go fmt on these changes.