laluka / bypass-url-parser

bypass-url-parser
https://linktr.ee/TheLaluka
GNU Affero General Public License v3.0
1.02k stars 108 forks source link

[Bug] Greedy regex matches garbage instead of full title on edge-cases #15

Closed ElSicarius closed 1 year ago

ElSicarius commented 1 year ago

Hello,

By default, BUP tries to find the title of the web page using the following regex (line 1236 on my version):

REGEX_TITLE = re.compile(r"<title>(.*)</title>", re.IGNORECASE)

The problem is the greedy (.*) part which could match several HTML "title" tags, and pollute the output logs (I often ran in this case, but can't share the targets here).

To fix this you could simply use an operator to make the regex "lazy":

REGEX_TITLE = re.compile(r"<title>(.*?)</title>", re.IGNORECASE)

Thus, the match will stop at the first "" tag. Of course, you'll have to make sure that the "search" (line 1532 on my version) doesn't return a list if there are multiple "title" tags in the page.

Example of problematic data:

<html><title>sample_page</title><body><title>sample_page_indeed</title></body></html>
laluka commented 1 year ago

Fixed, thanks mate! :clap: https://github.com/laluka/bypass-url-parser/commit/69b541346f33654e6c7b1639a47957f47651e93d