Closed vezinaca closed 4 years ago
str_html_working = "<td>142. The Wild Tchoupitoulas’ <em>The Wild Tchoupitoulas</em><br/>by Bryan Wagner"
str_html_not_working = "<td>85. Portishead’s <i>Dummy</i><br/>by RJ Wheaton<br/>Buy from Bloomsbury: <a href="
my_id_85 = re.search('<td>(.*)\. ', str_html_not_working)
if my_id_85 != None:
print("id:_" + my_id_85.group(1))
else:
print("no regex found")
my_id_142 = re.search('<td>(.*)\. ', str_html_working)
if my_id_142 != None:
print("id:_" + my_id_142.group(1))
else:
print("no regex found")
I see this little white dot between the '142.' and 'The' which I don't see on the line after the '85.' and 'Portishead'. Could that be it?
The dot means “space”. Maybe there’s a tab on the line with 85 and a space on the other line
But that’s not the problem. The problem is that .* is too greedy of a pattern — you need to find the pattern that allows to match digits only
Start by matching
On Feb 1, 2020, at 9:53 AM, vezinaca notifications@github.com wrote:
I see this little white dot between the '142.' and 'The' which I don't see on the line after the '85.' and 'Portishead'. Could that be it?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
there are many sites like this, but you can try https://www.regexpal.com/ to see how your regex is working in real time
there are also a million cheatsheets... like this one https://www.rexegg.com/regex-quickstart.html
not to repeat myself, but.......... definitely change the .* to something that matches digits :-)
cheers & let me know if I can help.