raduprv / Eternal-Lands

http://www.eternal-lands.com
Other
157 stars 57 forks source link

Improved URL parsing #134

Closed gamil-zirak closed 3 years ago

gamil-zirak commented 3 years ago

When parsing URLs there must be a non-alpha character to delimit the start of a URL and what's included in the URL more closely follows rfc1738.

gvissers commented 3 years ago

I hop you don't mind if I ask some questions to help me clarify what you changed.

gvissers commented 3 years ago

Partial reply to self: according to rfc3986, a single quote is a sub-delimiter.

gamil-zirak commented 3 years ago

Your understanding is correct of all the changes.

I think we can allow for the mistake of adding a character before the URL if the protocol specified. If the protocol is specified, we keep the old behaviour. However, if we match "www." then we require a non-alpha before the first "w". This should take care of most of the parsing issues that I've seen.

Adding sftp:// as another protocol is easy to do.

I see that the get_position function in font.cpp includes the ascii and non-ascii characters. I should have looked at that more closely. If allowing non-ascii is something we want to do, I can test it some more but I think this might be non-standard browser-dependent behaviour. I prefer to write some code that checks whether the character is ascii printable rather than relying on font.cpp's get_position.

I can't say how often semicolons or single quotes are used as delimiters. I saw that they're allowed in the rfc so that's what I went with. It would be easy to add them back to the delimiter list if we think it would cause more trouble than prevent.

gvissers commented 3 years ago

Would be nice if things would work with non-ascii characters, but I understand if you don't want to get into that.

Not too fond of the explicit range check. How about adding that helper in font.cpp, or using isascii() && isprint()?

There's a min2i in misc.h, let's use that instead of adding yet another MIN macro.

Oh, and I think you got the length to check (MIN(len,4)) wrong: len is the total length of the string, you are only checking from ptr onwards.

gamil-zirak commented 3 years ago

Yes, using isascii and isprint is better. I'll use this until we want to take on allowing non-ascii characters in URLs. Using MIN and len was wrong. I realized that right after I commited it. I don't think we actually need to put a length check in the compare unless we have a search_for string that's less than 4 characters. I don't see that ever happening.

gvissers commented 3 years ago

You must hate me by now :)

The strncmp(ptr, "www.", 4) == 0 check is superfluous.

I don't see that ever happening.

Better safe than sorry. I think we're being too clever with pointers and indices in this piece of code anyway. How about this (UNTESTED)?

    static const char* search_for[] = {"http://", "https://", "ftp://", "www."};
    static const int nr_search_for = sizeof(search_for) / sizeof(*search_for);
    int next_start = 0;

    while (next_start < len)
    {
        int url_start = len; /* set to max */
        int i;

        /* find the first of the url start strings */
        for (i = 0; i < nr_search_for; i++)
        {
            const char* ptr = safe_strcasestr(source_string+next_start, len-next_start, search_for[i], strlen(search_for[i]));
            if (ptr)
            {
                int start_idx = ptr - source_string;
                if (start_idx >= url_start)
                {
                    int ptr_len = len - start_idx;
                    if (strncmp(ptr, "www.", ptr_len) != 0 || ptr == source_string || !isalpha(*(ptr-1)))
                        url_start = start_idx;
                }
            }
        }

        /* if url found, store (if new) it then continue the search straight after the end */
        if (url_start < len)
        { ...

and then the rest changed as in your patch.

gamil-zirak commented 3 years ago

Ah yes, it is superfluous. I suppose I could say that you must hate me too by now but I'm just happy to have you point out these little things to make the code as good as we can. I'll copy the code you posted and run some tests on it. Thank you!

gvissers commented 3 years ago

I see you caught the fact I didn't check if ptr_len was >=4, thanks for that.

Merged now, thank you!