milesj / decoda

A lightweight lexical string parser for BBCode styled markup.
MIT License
196 stars 52 forks source link

Add better URL detection in ClickableHook #136

Open ghost opened 6 years ago

ghost commented 6 years ago

This will detect full URLs, currently it seems only the first part until the slash gets matched with that current regex.

milesj commented 6 years ago

This would theoretically allow invalid URLs.

ghost commented 6 years ago

You current approach isn't better. It's not like there there would be a buffer overflow if there's an invalid URL. OK, I'll try to make something better.

ghost commented 6 years ago

Here, I've fixed it.

milesj commented 6 years ago

Do you have example URLs that currently don't pass the validation?

ghost commented 6 years ago

Why don't you want to merge it? FILTER_VALIDATE_URL checks URLs strictly after RFC 2396 I doubt you can archive that level of validation with a regex. It's probably also much faster. It's much easier to maintain and read than that regex you use. This will also fix https://github.com/milesj/decoda/issues/129 and https://github.com/milesj/decoda/pull/130 (with bracket support).

HDVinnie commented 6 years ago

I have to agree with @Hyleus here @milesj . This is a much much better approach.

milesj commented 6 years ago

My only issue with the implementation is the massive O(N) loop at the beginning which traverses the entire content char-by-char. This has negative performance against large posts and or multiple decoda instances on the page.

I think a better approach would be to search for $split_chars first, and then iterate based on the found indices. Something like Decoda's lexer: https://github.com/milesj/decoda/blob/master/src/Decoda/Decoda.php#L1408

ghost commented 6 years ago

But isn't mb_strpos O(N) as well for 1-char strings?

milesj commented 6 years ago

Yes but native code > user-land code in most situations IMO.

ghost commented 6 years ago

Running time phpunit:

improve-url-matcher:
real    0m0.191s
user    0m0.159s
sys 0m0.031s

master:
real    0m0.192s
user    0m0.151s
sys 0m0.039s

So you see there's no big difference here, run it multiple times and it'll average the same probably.

milesj commented 6 years ago

You're testing in a best/perfect case, which is irrelevant to Big-O. Let me put it this way. You're current loop would do a strpos() lookup for every single character in the content, so if there's 10,000 characters, that's 10,000 strpos() lookups. If you do the inverse, it's multitudes less lookups. The big hurdle is space detection, which you can probably solve with split + splice.

ghost commented 6 years ago

Yeah, but we need to search for multiple split chars, so in the end if we determine the split_chars first it would have the same performance, because we would no matter what search for every split char in the string. No, the amount of lookups doesn't change, we would still run it 5 times for the split_chars, because there are 5 split chars.

milesj commented 6 years ago

True, what about preg_match + PREG_OFFSET_CAPTURE?

ghost commented 6 years ago

I think the regex would still work the same way as my loop internally. Or do you think there would be a difference?

milesj commented 6 years ago

My only issue is that we're looping over the the entire content twice now, once for the initial lexer, and again for this hook. That's quite a change.

I agree that this is a better way of finding URLs, I'm just worried about the perf changes. Ideally this would hook into the lexer somehow but that's probably more work than necessary.

dereuromark commented 4 years ago

Could this be a feature flag/option to turn on if you want it? Then people could decide on their own if it is worth sacrificing a bit of runtime in favor of more correct parsing? Just an idea.

alquerci commented 4 years ago

@dereuromark

Could this be a feature flag/option to turn on if you want it?

Yes, it can be a solution.


Firstly there is no test cases on this PR, first step is to add a failed test then make test pass then test the performance impact https://github.com/milesj/decoda/pull/136#issuecomment-353703208

Secondly there is maybe a better and more elegant way to achieve the same stuff but here there is the most important thing that is missing => the way to reproduce the issue. So it need at least one day of focus on it.

Both issues mentioned are already fixed.

This will also fix #129 and #130 (with bracket support).