[RFC] Tokenizer string deduplication

DaSourcerer commented 9 years ago

Building on #48, I experimented a bit with storing only token positions instead of entire tokens, leading to a slight decrease in memory usage. Results are so-so, and throughput improvements so far are within the margin of error. Still, I think this should be discussed.

DaSourcerer commented 9 years ago

I find it a bit interesting this PR isn't attracting any comments. @Art4? @shmax? @Kubo2?

Art4 commented 9 years ago

I don't see much benefit in storing only the positions. The slight decrease in memory usage is leading to increase in cpu usage because of substr(). Using $tokenizer->restart() one or multible times leads to more cpu usage but if we are storing the entire tokens it wont increase the memory usage.

shmax commented 9 years ago

My first thought is that if you're interested in improving the tokenizer, then a more valuable exercise might be to create an actual tokenizer. This would create similar output to token_get_all (http://php.net/manual/en/function.token-get-all.php), and not fall victim to some of the bracket traps that have been discussed here before (such as brackets in urls). Because It is common for tokenizers to create objects that encapsulate both a type and a string snippet, it sort of moots your approach here, so I would probably not pursue it.

I'd be greatly interested in collaborating on such a project.

Kubo2 commented 9 years ago

I don't see any benefits in this change (maybe there are some in case of very large input, but I am not very sure in this), more likely it would be to create a completely new tokenizer, which would recognize also some other types of tokens (e. g. option, option value, gracefully also specific (data) types of option values etc.), not only two like it actually does (similar thought also drew @shmax). That would be a really great change and also a first step on the path to rewriting parser. :-)

DaSourcerer commented 9 years ago

Let me thank you all for your responses first, they're greatly appreciated. Perhaps I should explain a bit what I tried to do here. Realization struck me one evening that we are holding input strings twice in memory: Once as the preserved input and once as tokenized strings. Now being a big fan of deduplication, I hoped storing the token offsets only would lead to a notable decrease in memory consumption and en passent give way to both, reduced overhead and a simplified tokenizer. Unfortunately I severely overestimated the possible gain of that. It is now clear that there will be only notable differences when processed input is reaching the range of several kb. Although I do have input that large in my project, it remains a rather rare exception making this kind of optimization not worthwhile.

Now the idea to transfer more logic from the parser to the tokenizer is very much opposed to to that but also intriguing. I'll try to sketch up a few ideas over the weekend and report back. In the meantime: All input is welcome :8ball:

jbowens / jBBCode

[RFC] Tokenizer string deduplication #49