java.net.URLDecoder doesn't follow RFC 2396, should use java.net.URI instead

lz233 / Tarnhelm

The magic to clean sharing links up.

https://tarnhelm.project.ac.cn

GNU General Public License v3.0

426 stars 17 forks source link

java.net.URLDecoder doesn't follow RFC 2396, should use java.net.URI instead #46

Closed CoelacanthusHex closed 7 months ago

CoelacanthusHex commented 7 months ago

https://github.com/lz233/Tarnhelm/compare/20231120...20240111

lz233 commented 7 months ago

Please elaborate

CoelacanthusHex commented 7 months ago

Release Note and apk are not aligned with git tag. Hope you can release 1.6.2 to include these changes in latest tag. Without "Fix: Failed to parse Telegram message" (build from tag such as installed from f-droid), when copying the invitation link from Telegram, "+" will be replaced by "\n".

lz233 commented 7 months ago

I guess this issue is triggered by https://github.com/lz233/Tarnhelm/blob/78a44b1f0c05dbae9e7d1414aa204036f465575f/app/src/main/java/cn/ac/lz233/tarnhelm/util/ktx/String.kt#L98

which input https://t.me/+xxxxx and output https://t.me/ xxxxx.

But consider a more complex condition, some Chinese apps insert Chinese characters in URLs, like https://y.music.xxx.com/m/song?name=歌曲名称 and http://xxxlink.com/xxxxxx，复制本条信息，打开【xxx】App查看精彩内容！. Without .decodeURL(), may lead to unexpected behaviours.

Recognition of Chinese characters in URLs is also not a bad idea, but it's too targeted and doesn't make me feel very elegant.

CoelacanthusHex commented 7 months ago

I guess this issue is triggered by

https://github.com/lz233/Tarnhelm/blob/78a44b1f0c05dbae9e7d1414aa204036f465575f/app/src/main/java/cn/ac/lz233/tarnhelm/util/ktx/String.kt#L98

which input https://t.me/+xxxxx and output https://t.me/ xxxxx.

But consider a more complex condition, some Chinese apps insert Chinese characters in URLs, like https://y.music.xxx.com/m/song?name=歌曲名称 and http://xxxlink.com/xxxxxx，复制本条信息，打开【xxx】App查看精彩内容！. Without .decodeURL(), may lead to unexpected behaviours.

Recognition of Chinese characters in URLs is also not a bad idea, but it's too targeted and doesn't make me feel very elegant.

I have no idea how to avoid recognition of Chinese characters, but I think using some Unicode script or block match like \p{Han} instead of a batch of Unicode range may be more elegant.

https://github.com/lz233/Tarnhelm/blob/78a44b1f0c05dbae9e7d1414aa204036f465575f/app/src/main/java/cn/ac/lz233/tarnhelm/util/ktx/String.kt#L133

CoelacanthusHex commented 7 months ago

I guess this issue is triggered by

https://github.com/lz233/Tarnhelm/blob/78a44b1f0c05dbae9e7d1414aa204036f465575f/app/src/main/java/cn/ac/lz233/tarnhelm/util/ktx/String.kt#L98

which input https://t.me/+xxxxx and output https://t.me/ xxxxx.

I wonder why decodeURL will replace "+" with " ". I read the RFC 3986, which said reserved characters (include "+") won't be touched when encode/decode.

But consider a more complex condition, some Chinese apps insert Chinese characters in URLs, like https://y.music.xxx.com/m/song?name=歌曲名称 and http://xxxlink.com/xxxxxx，复制本条信息，打开【xxx】App查看精彩内容！. Without .decodeURL(), may lead to unexpected behaviours.

Recognition of Chinese characters in URLs is also not a bad idea, but it's too targeted and doesn't make me feel very elegant.

CoelacanthusHex commented 7 months ago

oh, java's decode use application/x-www-form-urlencoded. It will convert between space and +.

CoelacanthusHex commented 7 months ago

https://stackoverflow.com/a/16453677

As this answer said, since our input isn't application/x-www-form-urlencoded, so we should use java.net.URI replace URLDecoder to get right result.

CoelacanthusHex commented 7 months ago

https://docs.oracle.com/javase/8/docs/api/java/net/URI.html

Document of java.net.URI also said they follow the RFC. I'm not familiar with Java/Kotlin. You can find the method we need from this document.

CoelacanthusHex commented 7 months ago

https://docs.oracle.com/javase/8/docs/api/java/net/URL.html

Oh. java.net.URL document mentioned the difference between it and URLDecoder.

Note, the URI class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use URI, and to convert between these two classes using toURI() and URI.toURL(). The URLEncoder and URLDecoder classes can also be used, but only for HTML form encoding, which is not the same as the encoding scheme defined in RFC2396.

lz233 commented 7 months ago

Changed to

fun String.encodeURL(): String = Uri.encode(this)
fun String.decodeURL(): String = Uri.decode(this)

For regex, I have tested ((https|http)://)?\p{L}+\.\p{L}+(:\p{Nd})?(\p{Ll}|\p{Lu}|\p{Nd}|/|\?|\+|&|=|\.|-|_|#|%)*, please consider if there are boundary cases. I'm not sure.