Closed Sp3EdeR closed 10 months ago
Thanks for making the PR! I feel that this approach is a bit too hacky.
- Ideally, the HTTP library (
requests
) should be doing the decoding. We shouldn't have to do it manually. I'll try to look into why this isn't working out of the box.- Parsing complex HTML using Regexes is a recipe for disaster. I'd prefer to use something like BeautifulSoup or the lxml package for this.
HTTP does not know HTML, since HTTP can transmit any number of documents. It is of course possible to add a full-on HTML parser library. I did write the regex to be standard-compliant though, so I would expect it to be much more robust than the existing <p>
parser at this point. Since the script needs to deal with just a specific program's output only, perhaps this might still be the smallest impact.
@iamkroot , I was thinking that perhaps it might be a good strategy to integrate this fix into the solution to provide a fix for non-English users at a short timeframe. Then subsequently implement an HTML parser library to provide a more robust solution. Then the maintenance overhead can be resolved as soon as possible while providing better user experience right away.
I'd prefer to keep this open for now, and only merge a proper solution. Considering that you are the first person to bring up this issue in over 4 years, I think it should be okay to not rush things :)
I've updated the PR according to the review comments.
Closes #205