simonw / strip-tags

CLI tool for stripping tags from HTML
Apache License 2.0
209 stars 6 forks source link

Issue with encoding while attempting to strip tags #7

Closed bowlingb closed 1 year ago

bowlingb commented 1 year ago

While attempting to strip tags from the NY Times front page I received the following error: Traceback (most recent call last): File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\Scripts\strip-tags.exe__main.py", line 7, in File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 1130, in call return self.main(*args, kwargs) File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 1055, in main rv = self.invoke(ctx) File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 760, in invoke return callback(*args, **kwargs) File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\site-packages\strip_tags\cli.py", line 119, in cli click.echo(final) File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\site-packages\click\utils.py", line 299, in echo file.write(out) # type: ignore File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 9634: character maps to

I don't know if this is an issue with strip-tags or a more core issue with encodings in general.

simonw commented 1 year ago

Could you grab a copy of an example of HTML that this broke on? I'd like to investigate further.

kmad commented 1 year ago

I got a similar issue with HTML from this document (https://www.sec.gov/Archives/edgar/data/1046179/000114554905001363/u99743a1fv3za.htm)

Error for me is:

Traceback (most recent call last):
  File "/Users/localuser/.local/bin/strip-tags", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/Users/localuser/.local/pipx/venvs/strip-tags/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/localuser/.local/pipx/venvs/strip-tags/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/localuser/.local/pipx/venvs/strip-tags/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/localuser/.local/pipx/venvs/strip-tags/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/localuser/.local/pipx/venvs/strip-tags/lib/python3.11/site-packages/strip_tags/cli.py", line 91, in cli
    soup = BeautifulSoup(input, "html5lib")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/localuser/.local/pipx/venvs/strip-tags/lib/python3.11/site-packages/bs4/__init__.py", line 314, in __init__
    markup = markup.read()
             ^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 2997: invalid start byte

Looks like a pretty classic encoding error when parsing a file, will look into this.

soobrosa commented 11 months ago

@simonw not really elegant but works for me :) https://github.com/simonw/strip-tags/pull/29

5shekel commented 1 week ago

still an issue

> strip-tags.exe --version
strip-tags, version 0.5.1
> curl -s https://arxiv.org/html/2409.18124v4 | strip-tags.exe 

powershell on windows

5shekel commented 1 week ago

@soobrosa , this oneliner helped, thanks soup = BeautifulSoup(input, "html5lib", from_encoding='utf-8', multi_valued_attributes=False)

5shekel commented 1 week ago

ok... it works great as is. but piping it to anything else brings this error up. pwsh/pipx/myEnv issue?

PS C:\dev\llmdev> curl -s https://arxiv.org/html/2409.18124v4 | strip-tags.exe | tee 2409.txt
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "c:\users\user\.local\bin\strip-tags.exe\__main__.py", line 7, in <module>
  File "C:\Users\user\AppData\Local\pipx\pipx\venvs\strip-tags\Lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\pipx\pipx\venvs\strip-tags\Lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\pipx\pipx\venvs\strip-tags\Lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\pipx\pipx\venvs\strip-tags\Lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\pipx\pipx\venvs\strip-tags\Lib\site-packages\strip_tags\cli.py", line 42, in cli
    click.echo(final)
  File "C:\Users\user\AppData\Local\pipx\pipx\venvs\strip-tags\Lib\site-packages\click\utils.py", line 318, in echo
    file.write(out)  # type: ignore
    ^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u2062' in position 620: character maps to <undefined>