Closed bowlingb closed 1 year ago
Could you grab a copy of an example of HTML that this broke on? I'd like to investigate further.
I got a similar issue with HTML from this document (https://www.sec.gov/Archives/edgar/data/1046179/000114554905001363/u99743a1fv3za.htm)
Error for me is:
Traceback (most recent call last):
File "/Users/localuser/.local/bin/strip-tags", line 8, in <module>
sys.exit(cli())
^^^^^
File "/Users/localuser/.local/pipx/venvs/strip-tags/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/localuser/.local/pipx/venvs/strip-tags/lib/python3.11/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/localuser/.local/pipx/venvs/strip-tags/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/localuser/.local/pipx/venvs/strip-tags/lib/python3.11/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/localuser/.local/pipx/venvs/strip-tags/lib/python3.11/site-packages/strip_tags/cli.py", line 91, in cli
soup = BeautifulSoup(input, "html5lib")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/localuser/.local/pipx/venvs/strip-tags/lib/python3.11/site-packages/bs4/__init__.py", line 314, in __init__
markup = markup.read()
^^^^^^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 2997: invalid start byte
Looks like a pretty classic encoding error when parsing a file, will look into this.
@simonw not really elegant but works for me :) https://github.com/simonw/strip-tags/pull/29
still an issue
> strip-tags.exe --version
strip-tags, version 0.5.1
> curl -s https://arxiv.org/html/2409.18124v4 | strip-tags.exe
powershell on windows
@soobrosa , this oneliner helped, thanks
soup = BeautifulSoup(input, "html5lib", from_encoding='utf-8', multi_valued_attributes=False)
ok... it works great as is. but piping it to anything else brings this error up. pwsh/pipx/myEnv issue?
PS C:\dev\llmdev> curl -s https://arxiv.org/html/2409.18124v4 | strip-tags.exe | tee 2409.txt
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "c:\users\user\.local\bin\strip-tags.exe\__main__.py", line 7, in <module>
File "C:\Users\user\AppData\Local\pipx\pipx\venvs\strip-tags\Lib\site-packages\click\core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Local\pipx\pipx\venvs\strip-tags\Lib\site-packages\click\core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Local\pipx\pipx\venvs\strip-tags\Lib\site-packages\click\core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Local\pipx\pipx\venvs\strip-tags\Lib\site-packages\click\core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Local\pipx\pipx\venvs\strip-tags\Lib\site-packages\strip_tags\cli.py", line 42, in cli
click.echo(final)
File "C:\Users\user\AppData\Local\pipx\pipx\venvs\strip-tags\Lib\site-packages\click\utils.py", line 318, in echo
file.write(out) # type: ignore
^^^^^^^^^^^^^^^
File "C:\Python311\Lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u2062' in position 620: character maps to <undefined>
While attempting to strip tags from the NY Times front page I received the following error: Traceback (most recent call last): File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\Scripts\strip-tags.exe__main.py", line 7, in
File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 1130, in call
return self.main(*args, kwargs)
File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, ctx.params)
File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\site-packages\click\core.py", line 760, in invoke
return callback(*args, **kwargs)
File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\site-packages\strip_tags\cli.py", line 119, in cli
click.echo(final)
File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\site-packages\click\utils.py", line 299, in echo
file.write(out) # type: ignore
File "C:\Users\Bowlingb\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 9634: character maps to
I don't know if this is an issue with strip-tags or a more core issue with encodings in general.