[Bug] Greedy regex matches garbage instead of full title on edge-cases

Hello,

By default, BUP tries to find the title of the web page using the following regex (line 1236 on my version):

REGEX_TITLE = re.compile(r"<title>(.*)</title>", re.IGNORECASE)

The problem is the greedy (.*) part which could match several HTML "title" tags, and pollute the output logs (I often ran in this case, but can't share the targets here).

To fix this you could simply use an operator to make the regex "lazy":

REGEX_TITLE = re.compile(r"<title>(.*?)</title>", re.IGNORECASE)

Thus, the match will stop at the first "" tag. Of course, you'll have to make sure that the "search" (line 1532 on my version) doesn't return a list if there are multiple "title" tags in the page.

Example of problematic data:

<html><title>sample_page</title><body><title>sample_page_indeed</title></body></html>

laluka / bypass-url-parser

[Bug] Greedy regex matches garbage instead of full title on edge-cases #15