mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
779 stars 146 forks source link

Some proposed code improvements #1

Closed tin-pot closed 7 years ago

tin-pot commented 7 years ago

You might want to consider (mostly) the commits 3202553 and 2af7f0a, and safely ignore what is going on in the msvc and VC9 directories ...

These changes concern for the most part "signed vs unsigned" conflicts, and some minor oddities. See the comments in the latest commit 2af7f0a.

I still had to introduce some (unsigned) casts - the explicitly signed index type for mark->next et al and the unsigned counter types for ctx->n_marks etc don't mix well.

The build with MD4C_WIN_USE_UNICODE seems broken right now: in md4html.c, pointers to char and MD_CHAR get mixed up, among similar issues. I don't care about Window's UTF-16 (or is it UCS-2?) flavor of Unicode, so this is fine with me for the time being.

Thank you, and keep up the good work!

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.4%) to 91.326% when pulling 2af7f0a9e5c9a9e31eb2198faae21d367e142b71 on tin-pot:master into c5fa9a70947e7742876622bb6d7ce0b1252b8d06 on mity:master.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.4%) to 91.326% when pulling e8c95ee4009e579455aeca7534c3ab4fa1d211cf on tin-pot:master into c5fa9a70947e7742876622bb6d7ce0b1252b8d06 on mity:master.

mity commented 7 years ago

I fixed the signed/unsigned issues a bot differently, and mostly switched to int where we can. This is IMHO better because for many internal stuff, we need to use -1 as an invalid index (e.g. for marks).

Ad the MD4C_WIN_USE_UNICODE: I fixed the few issues inside the md4c.c. But please note md2html does not support it and I do not plan it.

MD4C_WIN_USE_UNICODE is more meant for a Windows interactive Unicode GUI application. Consider e.g. some chatting GUI app. There is not many documents in UTF-16LE out there so I don't see much need to support it in a command line tools.

And yes, it is UTF-16LE, not UCS-2. Windows introduced full support for UTF-16 (i.e. added support for surrogates) in Win2k or maybe XP, if I can recall it correctly.

But it's true this might deserve some clarification in README.md or something.

mity commented 7 years ago

I have manually incorporated the mentioned stuff.

But note it would be so much more comfortable for me if you could create individual pull requests for independent changes (e.g. from individual per-pull-request branches), and only for things meant to get upstream. This kind of pull request does not allow independent code review and merge without manual steps, and the relevant changes are lost in plenty of changes not meant to get upstream.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.4%) to 91.326% when pulling 0920244026c6df52d4ec8b86104bf1a47910fd12 on tin-pot:master into bc52610ebc21cd175c77e05d13480de8fd683b82 on mity:master.

tin-pot commented 7 years ago

And yes, it is UTF-16LE, not UCS-2. Windows introduced full support for UTF-16 (i.e. added support for surrogates) in Win2k or maybe XP, if I can recall it correctly.

But it's true this might deserve some clarification in README.md or something.

Ah, thanks for the hint. I was not sure if "Unicode" in Windows still meant UCS-2 (that is, the BMP), but it seems full UTF-16 is supported.

But note that the designation "UTF-16LE" refers to the ordering of octets within a 16-bit "code unit": as it is natural for the x86 architecture, the «less significant octet precedes the more significant octet (also known as little-endian order)» [ISO/IEC 10646:2014, clause 10.4], which is what Windows uses.

That the "high-surrogate code unit" (in the range D800..DBFF) is the leading code unit in UTF-16, and is followed by a "low-surrogate code unit" (in the range DC00..DFFF) as the trailing code unit, is required by ISO/IEC 10646, and even Microsoft seems to conform:

For UTF-16, a "surrogate pair" is required to represent a single supplementary character. The first (high) surrogate is a 16-bit code value in the range U+D800 to U+DBFF. The second (low) surrogate is a 16-bit code value in the range U+DC00 to U+DFFF.

(Don't let these scalar ranges confuse you!)

That's why I found your comment in md4c.c (around l. 688) a bit puzzling (emphasis mine):

/* The encoding known called on Windows simply as "Unicode" is actually little-endian UTF-16, i.e. the low surrogate precedes the high surrogate. */

mity commented 7 years ago

That the "high-surrogate code unit" (in the range D800..DBFF) is the leading code unit in UTF-16, and is followed by a "low-surrogate code unit" (in the range DC00..DFFF) as the trailing code unit, is required by ISO/IEC 10646, and even Microsoft seems to conform

Yeah, great catch. Maybe I was over-thinking it.

That's why I found your comment in md4c.c (around l. 688) a bit puzzling (emphasis mine):

/ The encoding known called on Windows simply as "Unicode" is actually little-endian UTF-16, i.e. the low surrogate precedes the high surrogate. /

If it is the comment what brought your attention to this issue, then the comment worked perfectly ;-)

tin-pot commented 7 years ago

If it is the comment what brought your attention to this issue, then the comment worked perfectly ;-)

Hehe - it kind of did, yes.

And this brings up the questions whether MD4C

  1. should really depend on the <windows.h> header (I think <tchar.h> would be enough); and
  2. should also support the 32-bit wchar_t type as used by gcc for UTF-32 (at least in some implementations, as far as I know); and
  3. should support/use/emulate the new-fangled <uchar.h> Unicode support types and functions that C11 (ISO/IEC 9899:2011) brought us, that is char16_t for UTF-16 and char32_t for UTF-32 etc.

But that seems to be a topic for a dedicated thread (maybe on your project's Wiki?)

mity commented 7 years ago

should really depend on the header (I think would be enough)

<tchar.h> is more to introduce the ANSI/Unicode dualism depending on -DUNICODE, to make building an ANSI or Unicode Windows binary from a single code base possible. It's not for introducing the type WCHAR or wchar_t itself. Maybe something like <winnt.h> or <ntdef.h> would be more appropriate but I am not sure if apps are supposed to include these directly. Anyway, even console-type apps often include <windows.h> on Windows.

should also support the 32-bit wchar_t type as used by gcc for UTF-32 (at least in some implementations, as far as I know)

It's not that much matter of gcc or compiler, but rather of standard C library implementation and contract provided by the given platform. BTW, this should be true where macro __STDC_ISO_10646__ is predefined.

I would keep the door open but implement it only if there is real demand. I don't see much sense in developing something what would not be tested/maintained.

Also note that on such platforms it is actually rarely used for (really widely used) string implementation. E.g. on Linux (where it is true), most libs like, Qt/kde/gtk+ and so on more works with UTF-8 directly.

I think Windows is quite specific in its use of wide strings: Resources embedded in .exe or .dll (e.g. string tables, dialog templates etc.) as well as any keyboard handling API and Win32API as a whole uses dominantly the WCHAR* as string. (Well, ANSI versions of the same functions are provided too, but they just mostly hide the conversions. Windows libs internally work with the wide strings.)

should support/use/emulate the new-fangled Unicode support types and functions that C11 (ISO/IEC 9899:2011) brought us, that is char16_t for UTF-16 and char32_t for UTF-32 etc.

I don't want to make C11 as a requirement. Too many platforms do not support it so we cannot rely on it on the interface level (md4c.h), and introducing some emulation of it just for internal purposes, i.e. for single source file, does not make much sense to me.

tin-pot commented 7 years ago

I largely agree with your positions: my remarks weren't meant as urgent requests, but mostly to map out possible development directions.

But still I see the difference between using <tchar.h> vs <windows.h> as the difference between compiler/runtime-library vs platform. Since MD4C does not depend on any platform-specific features, but only on features of the compiler/runtime-library aka "C implementation", I'd prefer <tchar.h>.

Note that <tchar.h> restrains itself to names reserved for the implementation, while <windows.h> of course introduces a lot of names into the program's namespace. I consider this an important point.

The <tchar.h> header does (if _UNICODE is defined), in fact also draw in <wchar.h>, providing wchar_t, wint_t, wctype_t, the _TCHAR alias for wchar_t, the _T() and _TEXT() macros for wide string literals, the whole crowd of aliases like _tcslen() for wcslen() etc. I still deem it a more appropriate and parsimonious choice than <windows.h>.

The <winnt.h> header (drawn in by <windows.h>) defines TCHAR (not _TCHAR) and TEXT() (not _T() nor _TEXT()!) when UNICODE (not _UNICODE!) is defined. The LPTSTR etc names provided there (but not in <tchar.h>!) are not used in MD4C anyway.

Actually, with md4c.h using

#ifdef _MSC_VER
#include <tchar.h> /* Define _TCHAR and _T() according to defined(_UNICODE) */
typedef _TCHAR MD_CHAR
#else
typedef char MD_CHAR
#endif

and in md4c.c accordingly

#ifndef _T /* Should only happen ifndef _MSC_VER ... */
#define _T(x) x
#endif

the md4c module builds fine (with or without _UNICODE). But md2html.c has of course the same problems as before with the use of char * in membuf_append() and friends.

If including <windows.h> (or winnt.h>) is still preferred, the symbols UNICODE, TCHAR, and TEXT() should be used instead of _UNICODE, _TCHAR and _T() (rsp _TEXT()), which are all from <tchar.h>.

I think Windows is quite specific in its use of wide strings:

It is in fact, and that's a good reason to support _UNICODE builds (or UNICODE builds, for that matter).

For "the other" C implementations, UTF-8 represented as char arrays is probably enough; since UTF-32 using wchar_t is rarely used in APIs or applications, as you correctly point out.

This leaves (as the third and last case) supporting UTF-8 on Windows, which MD4C already does, if I understand correctly - if MD4C_USE_UNICODE is defined, but not MD4C_WIN_USE_UNICODE rsp the equivalent _UNICODE (or UNICODE).

Gosh, cross-platform Unicode support is really a mess!

mity commented 7 years ago

Then we should likely #include <wchar.h> on Windows. We are not supposed to depend on whethr _UNICODE is defined or not.

tin-pot commented 7 years ago

Hmm, I thought on Windows the MD4C_WIN_USE_UNICODE feature macro had for MD4C the same role and purpose as the _UNICODE macro for <tchar.h>: request MD_CHAR to denote UTF-16 code units, just like _UNICODE does for _TCHAR. And maybe switch some string-related functions.

If MD4C ignores _UNICODE (or UNICODE right now, but in this case, why access <windows.h> at all?), it would just replicate the <tchar.h> functionality anyway: either define MD_CHAR to be char and _T() to vanish; or define MD_CHAR to be wchar_t and _T() to prefix the wide string literal indicator L. Just depending on MD4C_WIN_USE_UNICODE.

One can certainly - and easily - do so, and achieve independence from the _UNICODE or UNICODE defines, by including <wchar.h> for typedefs conditionally on MD4C_WIN_USE_UNICODE (but then a different name than _T() for the wide string macro would be more fitting, I'd say ...).

Or heed the _UNICODE define (indirectly, through use of <tchar.h>), and compile accordingly (only on Windows rsp MSVC, of course).

But I see no problem either way.

mity commented 7 years ago

Hmm, I thought on Windows the MD4C_WIN_USE_UNICODE feature macro had for MD4C the same role and purpose as the _UNICODE macro for <tchar.h>: request MD_CHAR to denote UTF-16 code units, just like _UNICODE does for _TCHAR. And maybe switch some string-related functions.

Exactly. But MD4C does it in an independent way. If it would (always) depend on UNICODE, then using <tchar.h> would be the right thing. But I feel some Unicode apps may want MD4C to use UTF-8 (if it want to play with some Markdown documents on disk) and other with UTF-16 (e.g. if they render whatever user types in its GUI) and that's why MD4C has an independent konb to enable/disable the UTF-16 logic.

(but then a different name than _T() for the wide string macro would be more fitting, I'd say ...).

Actually, the very early versions depended by default on UNICODE (on Windows), later there was a preprocessor knob to disable it and enforce use of char even if UNICODE is defined. And I eventually found the logic unnecessarily complicated and changed it to the current simple logic based just on MD4C_WIN_USE_UNICODE.

But the (mis)usage of _T in md4c.c (with the redefinition) survived all the changes. So it is a historical relic and it maybe should be renamed to avoid the confusion.

The only advantage _T has is that I am already sort of used to type it when writing literals, any any other spelling would slow me down :-) but most main features are already done so the change would not be that much painful for me anymore.

tin-pot commented 7 years ago

Exactly. But MD4C does it in an independent way. If it would (always) depend on UNICODE, then using <tchar.h> would be the right thing. But [...]

Ok, I see your point. In this case, including just <wchar.h> and do defines as described above should be enough.

The only advantage _T has is that I am already sort of used to type it when writing literals, any any other spelling would slow me down :-)

Oh, I don't want to risk that! ;-) (Just make sure there is no conflict with the Windows definitions of _T, CHAR, etc.)

Any chance to name the other Unicode feature macro something like MD4C_USE_UTF8 instead of MD4C_USE_UNICODE? Wold look a bit more explicit to me.

mity commented 7 years ago

Just make sure there is no conflict with the Windows definitions of _T, CHAR, etc.

_T is guaranteed by the #undef after all the #includes, and as we use it only in that single file, it cannot collide anywhere.

CHAR is preprocessor macro shadowing (within md4c.c) a typedef from Win32API if we are on Windows.

I know, it may look a little bit hacky, but these are used so frequently withing md4c.c that I could not find any other reasonable (and short) spelling.

Any chance to name the other Unicode feature macro something like MD4C_USE_UTF8 instead of MD4C_USE_UNICODE? Wold look a bit more explicit to me.

Yeah. This is good idea. (That is another historical relic as originally the logic worked differently.)