ndmitchell / hexml

A bad XML parser
Other
19 stars 4 forks source link

Improve the C-ness of the code #7

Open ndmitchell opened 7 years ago

ndmitchell commented 7 years ago

@thoughtpolice in https://www.reddit.com/r/haskell/comments/5i2mg1/new_xml_parser_hexml/db5os2h/ says:

Also, IMO, the C library could be improved a bit, too e.g. it should probably be namespaced so everything is under hexml_, and I'm not sure about the usage of document_parse taking an int vs a size_t which is what strlen returns (int and size_t do not have the same sign, making such loose conversions extremely dangerous).

The only reason document_parse handles int is so you can pass -1 so it will call strlen itself, but frankly, I'd just suggest making a totally separate function to do this. Or just not have this 'feature' at all -- users should always be expected to reasonably know the size of the fragment they pass to document_parse, right? It also assumes that the input to the string is actually null terminated, but if it isn't, strlen is going to do something very bad (by returning some massive integer). Assuming the input is hostile means I'm always going to assume someone, somehow, snuck in a non-null terminated string...

ndmitchell commented 7 years ago

Adding a hexml_ prefix is pure win.

There were good reasons for the int thing originally - you could pass a string which was not null terminated, and use the length. Or you could pass a string that was null terminated, and it wouldn't ever look for the trailing zero in advance, it emerged as a consequence of the operation, so saved you a string walk. As it stands, there's little reason for the -1 nastiness, and thus size_t makes more sense.

ndmitchell commented 7 years ago

I have prefixed all exported functions with hexml_ and added static annotations to the rest. Should I also be mangling the types? e.g. hexml_str? Should it really be hexml_str_t by convention?

Regarding signed operations - I am using int32_t in the str data type since I want to use less memory, and the document is essentially just str data types. It seems like that should really be uint32_t. For length of the string, size_t is plausible, but for most other contexts should I be using size_t? unsigned? uint32_t? What should I be doing by default, and what should I be converting to/from at the edges?