html5lib / html5lib-python

Standards-compliant library for parsing and serializing HTML documents and fragments in Python
MIT License
1.13k stars 284 forks source link

HTMLTokenizer.stream.chunkOffset not updating on string with no html elements #571

Open ehsmeng opened 1 year ago

ehsmeng commented 1 year ago
from html5lib._tokenizer import HTMLTokenizer
from io import StringIO

class T():
    def __init__(self, data):
        print("Object from string: " + data)
        self.src = StringIO()
        self.tokenizer = HTMLTokenizer(self.src)

        pos = self.src.tell()
        self.src.write(data)
        self.src.seek(pos)
        self.handle_tokens()
        self.src.close()

    def handle_tokens(self):
        for token in self.tokenizer:
            print(str(self.tokenizer.stream.chunkOffset))

T("klas katt")
T("klas katt<br>")

->

Object from string: klas katt 0 Object from string: klas katt
9 13

I expected first number outputted to be 9.

Apologies if this is an internal variable I should not use. I'm trying to deduce tag offsets (start,stop) in the html document.