ruby / uri

URI is a module providing classes to handle Uniform Resource Identifiers
https://ruby.github.io/uri/
Other
78 stars 42 forks source link

Pattern matches does not work for google search results #74

Open brandonbrown5 opened 1 year ago

brandonbrown5 commented 1 year ago

Google search result URL raises Invalid URI error. It appears the Regex expression here does not recognize this as a valid URL, however, you are able to navigate to it via a browser.

URL: https://www.google.com/search?q=capt.%20jacks%20family%20buffet&rlz=1C2CHBF_enUS902US902&sxsrf=APwXEdehG3ObQHEcqZT0clDT-XUDJ2iaXg:1681756568453&source=hp&ei=jpE9ZIioNOvGkPIPzP2ayAE&iflsig=AOEireoAAAAAZD2fnm-EI4rFn06RvhHNRndJIcwCmIRY&oq=capt.+jack&gs_lcp=Cgdnd3Mtd2l6EAEYADIFCAAQgAQyCgguEIAEENQCEAoyBwgAEIAEEAoyCwguEIAEEMcBEK8BMgUIABCABDIFCAAQgAQyBQgAEIAEMgcIABCABBAKMgoILhCABBDUAhAKMggIABCKBRCGAzoHCCMQ6gIQJzoECCMQJzoICAAQigUQkQI6CAgAEIAEELEDOhEILhCABBCxAxCDARDHARDRAzoOCC4QgAQQsQMQxwEQ0QM6DgguEIoFEMcBENEDEJECOg4ILhCABBDJAxDHARCvAToFCC4QgAQ6DgguEIoFEMcBEK8BEJECOgsIABCKBRCxAxCRAjoOCC4QgAQQsQMQgwEQ1AI6CwguEK8BEMcBEIAEOg0ILhCABBDHARCvARAKOgcILhCABBAKOggILhCABBDUAlC3DliGK2DeN2gBcAB4AIABnwGIAaIJkgEDMi44mAEAoAEBsAEK&sclient=gws-wiz&tbs=lf:1,lf_ui:4&tbm=lcl&rflfq=1&num=10&rldimm=425615111808136386&lqi=ChljYXB0LiBqYWNrcyBmYW1pbHkgYnVmZmV0IgOIAQFI2Oj5sryugIAIWioQABABEAIQAxgAGAEYAhgDIhhjYXB0IGphY2tzIGZhbWlseSBidWZmZXSSARFidWZmZXRfcmVzdGF1cmFudJoBI0NoWkRTVWhOTUc5blMwVkpRMEZuU1VOUGNUUTJXa0pSRUFFqgEjEAEyHxABIhtkYOxvAEEUmUj2WhSyHC6JH-F_P7crMXEaKS_gAQA&ved=2ahUKEwjS1_G2x7H-AhUxtTEKHclvB_cQvS56BAgWEAE&sa=X&rlst=f#rlfi=hd:;si:425615111808136386,l,ChljYXB0LiBqYWNrcyBmYW1pbHkgYnVmZmV0IgOIAQFI2Oj5sryugIAIWioQABABEAIQAxgAGAEYAhgDIhhjYXB0IGphY2tzIGZhbWlseSBidWZmZXSSARFidWZmZXRfcmVzdGF1cmFudJoBI0NoWkRTVWhOTUc5blMwVkpRMEZuU1VOUGNUUTJXa0pSRUFFqgEjEAEyHxABIhtkYOxvAEEUmUj2WhSyHC6JH-F_P7crMXEaKS_gAQA;mv:[[30.1955067,-85.7794086],[30.161907099999993,-85.8386264]];tbs:lrf:!1m4!1u3!2m2!3m1!1e1!2m1!1e3!3sIAE,lf:1,lf_ui:4

URI.parse(url) raises the following error: lib/uri/rfc3986_parser.rb:66:insplit'`. I believe this is caused by the Regex expression not matching this URL.

duerst commented 1 year ago

And the reason that the regular expression does not match the URI is that the relevant grammar (in RFC 3986) does not allow '[' or ']' in the fragment part (the part after the '#'). See https://www.rfc-editor.org/rfc/rfc3986#appendix-A, in particular see https://www.rfc-editor.org/rfc/rfc3986#appendix-A, and look for 'fragment' and 'gen-delims'. The '[' and ']' characters are in gen-delims, but gen-delims isn't allowed in fragment. As the filename where the error message originates makes clear, it's a parser for RFC 3986 URIs, so it better follow that spec. That means that we can close this issue, because the Regexp matches the spec.

The grammar in RFC 2396 (https://www.rfc-editor.org/rfc/rfc2396) is more lenient, and is available in lib/uri/rfc2396_parser.rb, so you may want to try it.

[In Thunderbird, where I saw your message first, the URI is colored up to just before the first ':' in the fragment, and when I click on it, only the part before that ':' is sent to the browser, but both RFC 3986 and RFC 2396 allow ':' in fragments, so this behavior is difficult to explain.]