Open ghost opened 6 years ago
This would theoretically allow invalid URLs.
You current approach isn't better. It's not like there there would be a buffer overflow if there's an invalid URL. OK, I'll try to make something better.
Here, I've fixed it.
Do you have example URLs that currently don't pass the validation?
Why don't you want to merge it? FILTER_VALIDATE_URL checks URLs strictly after RFC 2396 I doubt you can archive that level of validation with a regex. It's probably also much faster. It's much easier to maintain and read than that regex you use. This will also fix https://github.com/milesj/decoda/issues/129 and https://github.com/milesj/decoda/pull/130 (with bracket support).
I have to agree with @Hyleus here @milesj . This is a much much better approach.
My only issue with the implementation is the massive O(N) loop at the beginning which traverses the entire content char-by-char. This has negative performance against large posts and or multiple decoda instances on the page.
I think a better approach would be to search for $split_char
s first, and then iterate based on the found indices. Something like Decoda's lexer: https://github.com/milesj/decoda/blob/master/src/Decoda/Decoda.php#L1408
But isn't mb_strpos
O(N) as well for 1-char strings?
Yes but native code > user-land code in most situations IMO.
Running time phpunit
:
improve-url-matcher:
real 0m0.191s
user 0m0.159s
sys 0m0.031s
master:
real 0m0.192s
user 0m0.151s
sys 0m0.039s
So you see there's no big difference here, run it multiple times and it'll average the same probably.
You're testing in a best/perfect case, which is irrelevant to Big-O. Let me put it this way. You're current loop would do a strpos()
lookup for every single character in the content, so if there's 10,000 characters, that's 10,000 strpos()
lookups. If you do the inverse, it's multitudes less lookups. The big hurdle is space detection, which you can probably solve with split + splice.
Yeah, but we need to search for multiple split chars, so in the end if we determine the split_chars first it would have the same performance, because we would no matter what search for every split char in the string. No, the amount of lookups doesn't change, we would still run it 5 times for the split_chars, because there are 5 split chars.
True, what about preg_match
+ PREG_OFFSET_CAPTURE
?
I think the regex would still work the same way as my loop internally. Or do you think there would be a difference?
My only issue is that we're looping over the the entire content twice now, once for the initial lexer, and again for this hook. That's quite a change.
I agree that this is a better way of finding URLs, I'm just worried about the perf changes. Ideally this would hook into the lexer somehow but that's probably more work than necessary.
Could this be a feature flag/option to turn on if you want it? Then people could decide on their own if it is worth sacrificing a bit of runtime in favor of more correct parsing? Just an idea.
@dereuromark
Could this be a feature flag/option to turn on if you want it?
Yes, it can be a solution.
Firstly there is no test cases on this PR, first step is to add a failed test then make test pass then test the performance impact https://github.com/milesj/decoda/pull/136#issuecomment-353703208
Secondly there is maybe a better and more elegant way to achieve the same stuff but here there is the most important thing that is missing => the way to reproduce the issue. So it need at least one day of focus on it.
Both issues mentioned are already fixed.
This will also fix #129 and #130 (with bracket support).
This will detect full URLs, currently it seems only the first part until the slash gets matched with that current regex.