micropython / micropython-lib

Core Python libraries ported to MicroPython
Other
2.41k stars 998 forks source link

XML tokenizer #37

Closed pfalcon closed 8 years ago

pfalcon commented 9 years ago

Lately, I was looking into parsing XML. Doing my homework, I looked into bunch of small XML parsers, but they all were still too bloated and insisted on well-formedness checking, and had other funky things.

I was looking into really minimal way to parse/extract useful parts from XML, and that's just working with stream of tokens. So, I wrote a simple tokenizer in Python. There's no XML tokenizer module in CPython standard lib, but I guess it would be nice addition to micropython-lib, as other XML APIs are unlikly coming soon.

Thoughts?

I propose "xmltok" for module name.

dpgeorge commented 9 years ago

What are existing (generic) tokeniser options in CPy? A generic tokeniser might be more useful (but also more difficult to design so copying existing might be a way to go).

pfalcon commented 9 years ago

In CPython stdlib? re. In general? PLY http://www.dabeaz.com/ply/ appears to be popular choice, but I never used it.

And yes, as usual, it's quite different tasks (and results) of implementing tokenizer for X via a generic tokenization framework, and just adhoc, optimized tokenizer for X. (But I ended up reusing generic hand-coded LL(1) parser approach here, which is my favorite.)

dpgeorge commented 9 years ago

PLY looks neat. I tried it with uPy but it won't run because it uses sys._getframe to do munging of locals/globals. A shame. But anyway it's overkill to implement a simple XML parser.

For parsing XML in CPython there are many (too many I'd say) ways of doing it; eg lxml, ElementTree, minidom. But my guess is that these all parse the entire file into RAM, which isn't going to work well in uPy for large files, especially when you only need a few bits of the XML data (or can operate on the data in a sequential way).

So I'd say having an XML tokeniser with which you can write your own simple XML parser for a given application is a good idea.

In CPython the xml stuff lives in xml/. So the tokeniser could be xml/tok.py. But I think xmltok.py is simpler and probably a better option so we don't clash with CPython in the future. (Or we could have uxml/ package...).

Is it worth thinking about an XML writer (eg xmlprint), so as to have a compatible counterpart to xmltok?

danicampora commented 9 years ago

I have an xml tokenizer that I wrote in C from scratch because everything else I could find seemed bloated or/and overkill. I wrote it ages ago so it can probably be cleaned and optimized a lot, but it works pretty well, and it has been deployed (on the LaserTag vests) without any issues. It reads the xml file in 512 byte chunks and calls the user registered handler for each token. Supports attributes as well.

I'll prepare it for public comsuption and share it over here, maybe you'll like it, maybe not.

On Jul 21, 2015, at 10:37 PM, Damien George notifications@github.com wrote:

PLY looks neat. I tried it with uPy but it won't run because it uses sys._getframe to do munging of locals/globals. A shame. But anyway it's overkill to implement a simple XML parser.

For parsing XML in CPython there are many (too many I'd say) ways of doing it; eg lxml, ElementTree, minidom. But my guess is that these all parse the entire file into RAM, which isn't going to work well in uPy for large files, especially when you only need a few bits of the XML data (or can operate on the data in a sequential way).

So I'd say having an XML tokeniser with which you can write your own simple XML parser for a given application is a good idea.

In CPython the xml stuff lives in xml/. So the tokeniser could be xml/tok.py. But I think xmltok.py is simpler and probably a better option so we don't clash with CPython in the future. (Or we could have uxml/ package...).

Is it worth thinking about an XML writer (eg xmlprint), so as to have a compatible counterpart to xmltok?

— Reply to this email directly or view it on GitHub.

pfalcon commented 9 years ago

For parsing XML in CPython there are many (too many I'd say) ways of doing it

So, "minimal" way to deal with XML in CPython is pyexpat module, which now has "external API" name of xml.parsers.expat: https://docs.python.org/3/library/pyexpat.html . To remind, there're 2 approaches (APIs) of XML parsing: on the lower level, SAX, which is essentially glorified tokenizer (with well-wormedness checking, etc.), which returns streams of tokens, but usually structured as callback API, and DOM, which builds complete tree in memory.

expat is reference SAX parser, but it itself and pyexpat module are too bloated for uPy to mimic. But it gives good ideas: even though official Python's XML handling is in xml.* package, individual support modules don't have to be, so yes, having just "xmltok" module should be ok.

pfalcon commented 9 years ago

Is it worth thinking about an XML writer (eg xmlprint), so as to have a compatible counterpart to xmltok?

I don't know. It's much easier to generate structured information than to parse - you can just pretend it's flat ;-). It would be different matter is whole Pythons structures (almost) directly mapped to XML (like is the case for JSON), but it's not, so there can't be both universal and easy solution, so I'd skip that for now.

pfalcon commented 9 years ago

I have an xml tokenizer that I wrote in C from scratch because everything else I could find seemed bloated or/and overkill. ... I'll prepare it for public comsuption and share it over here, maybe you'll like it, maybe not.

Yes, feel free to, but for this case, I already settled on Python parser, but maybe it'll be useful on next iteration. (If someone asked about XML for uPy, I'd myself answered: skip it! But lately I decided to play with UPNP, and here it goes ;-). )

pfalcon commented 9 years ago

When deciding on API for XML tokenizer (and that's critical-path TODO to finish this ticket), worth trying to follow stdlib tokenizer module API whenever makes sense: https://docs.python.org/3/library/tokenize.html

dpgeorge commented 9 years ago

Yes that makes sense. It's really just the one function: tokenize(). Then also has room in the future to implement untokenize() for an XML pretty printer.