matthewwithanm / python-markdownify

Convert HTML to Markdown
MIT License
1.04k stars 135 forks source link

The incorrect h tag could cause the machine freeze due to excessive memory use #143

Open vokiput opened 1 month ago

vokiput commented 1 month ago

markdownify in ./.venv/lib/python3.8/site-packages (0.13.1)

The issue is found with atheris library

The code to reproduce the issue:

markdownify("<html><body><h5555555555>My First Heading</h5555555555><p>My first paragraph.</p></body></html>")

My machine had frozen. Ubuntu 20.04. 16 GB. The memory usage went to 100% in 2-3 seconds. The only way to fix it is to turn it off/on.

The only valid cases are h1 - h6. We should ignore everything else. It could be an edge case but it could be possible to feed the string in the example into a server to cause resource exhaustion.

Related cases are (will be fixed if we fix the original issue)

import sys
markdownify(f"<h5{sys.maxsize // 10}>")
Traceback (most recent call last):
  File "/home/redacted/code/other/atheris/pythonProject/test_unit_009-1.py", line 22, in <module>
    markdownify(f"<h5{sys.maxsize // 10}>")
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 433, in markdownify
    return MarkdownConverter(**options).convert(html)
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 105, in convert
    return self.convert_soup(soup)
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 108, in convert_soup
    return self.process_tag(soup, convert_as_inline=False, children_only=True)
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 151, in process_tag
    text += self.process_tag(el, convert_children_as_inline)
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 156, in process_tag
    text = convert_fn(node, text, convert_as_inline)
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 188, in convert_tag
    return self.convert_hn(n, el, text, convert_as_inline)
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 283, in convert_hn
    hashes = '#' * n
MemoryError

and

import sys
markdownify(f"<h{sys.maxsize + 1}>")
Traceback (most recent call last):
  File "/home/redacted/code/other/atheris/pythonProject/test_unit_009-1.py", line 15, in <module>
    markdownify(f"<h{sys.maxsize + 1}>")
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 433, in markdownify
    return MarkdownConverter(**options).convert(html)
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 105, in convert
    return self.convert_soup(soup)
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 108, in convert_soup
    return self.process_tag(soup, convert_as_inline=False, children_only=True)
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 151, in process_tag
    text += self.process_tag(el, convert_children_as_inline)
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 156, in process_tag
    text = convert_fn(node, text, convert_as_inline)
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 188, in convert_tag
    return self.convert_hn(n, el, text, convert_as_inline)
  File "/home/redacted/code/other/atheris/pythonProject/.venv/lib/python3.8/site-packages/markdownify/__init__.py", line 283, in convert_hn
    hashes = '#' * n
OverflowError: cannot fit 'int' into an index-sized intege