Open lazex opened 3 years ago
◯ and ✕ are full width character as well as あ.
I don't think that's true. At least, on my terminal the first two take up one space and the third two spaces. And the same is true as it displays above in the code block.
In fact this works just fine!
+---------+---------+---------+
| | column1 | column2 |
+:========+:=======:+:=======:+
| row1 | x | a |
+---------+---------+---------+
| row2 | ◯ | a |
+---------+---------+---------+
| row3 | ✕ | a |
+---------+---------+---------+
| row4 | あ | a |
+---------+---------+---------+
Check it on try pandoc.
Edit: There is something a bit odd here. In the code block above (as in yours), the pipes on the last line aren't fully lined up. However, they do appear exactly lined up in my text editor. I don't know how to explain that, but what we're aiming for is proper alignment in a text editor.
Let's see what happens if we add an extra space in that last line:
+---------+---------+---------+
| row4 | あ | a |
+---------+---------+---------+
That's definitely not lined up. So the slight misalignment in the code block as rendered in the browser seems to be a browser rendering bug of some kind. The browser definitely isn't treating the character as single-wide, but it's not giving it full double width either.
Upshot: not a bug, as far as I can see.
In my environment, they are displayed as full-width characters in the text editor such as Windows Notepad and VSCode. Maybe it depends on the locale or font. My environment is Japanese. The attached screenshot shows the view using Notepad.
OK. That explains it. In https://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt we see
25EF;A # So LARGE CIRCLE
The "A" means "ambiguous." "Ambiguous characters behave like wide or narrow characters depending on the context (language tag, script identification, associated font, source of data, or explicit markup; all can provide the context). If the context cannot be established reliably, they should be treated as narrow characters by default." So in your locale it is wide.
doclayout is the library we use to compute "real widths" for layout. It currently just treats all ambiguous characters as narrow. I'll move this issue to doclayout as a suggestion for further improvement. (It would require some way to make doclayout's functions locale-sensitive, not a small change.)
@Xitian9 - I believe you mentioned the possibility that this issue would arise!
Ha! That was fast.
I guess the question is how do we accurately and reliably determine the width. If there is surrounding context then it should be straightforward: we can add a context specifier to the MatchState
. However, in this situation it looks like there would be no surrounding context, just a bare character put into a table. Can we try to guess based on other characters in the column? In the row? Some other way? I sense dangerous creatures this way.
One approach would be to add a function that allows you to locally set the context, such as
withWideContext (literal "◯")
Pandoc could then put the whole document in withWideContext
if the locale is a wide-character locale. This global setting could be overridden in parts of the document that were marked up as different languages using withNarrowContext
. Or we could have withLocale locale
. Just some ideas.
Good idea. Next problem: there are a lot of ambiguous characters in the unicode spec. There are 198 separate entries (which include ranges) in EastAsianWidth.txt
.
It is error-prone and tedious to define these ourselves. Maybe we should teach doclayout
how to read EastAsianWidth.txt
and generate it itself. This could be done similarly to how the emoji are handled in emojis
. Thoughts?
It is error-prone and tedious to define these ourselves. Maybe we should teach doclayout how to read EastAsianWidth.txt and generate it itself. This could be done similarly to how the emoji are handled in emojis.
Makes sense to me. (We should use the approach in emojis, where the parsing code isn't part of the library and thus doesn't add dependencies.)
@Xitian9 has now provided a context-aware realLength
function.
Now it remains to figure out how to modify the rest of the library so that it can be used. It's not as easy as I'd originally thought. For example, we have a literal :: HasChars a => a -> Doc a
which calls realLength
. How is this going to know which context to use?
One approach would be to change the Doc a
type so that it's something like Reader Context (DocT a)
. literal
could then use ask
to retrieve the right context. local
could be used for local changes in the context (wide or narrow) depending on e.g. lang
attributes. This would probably slow things down somewhat, but I don't currently have other ideas.
The Reader approach would require a lot of changes. Maybe we could do something simpler, e.g. just adding literalWide
. This would require that the calling program keep track of the context and use literal
or literalWide
accordingly.
EDIT: The problem with this approach is that we sometimes use realLength
again after re-rendering, e.g. in minOffset
or when stuffing text into a block. Actually, that's a feature of the code I don't like. If there were a way to handle these things without re-rendering, things would go more smoothly (and performance would be better).
To be clearer, the central problem is this: we have
data Doc a = Text Int a -- ^ Text with specified width.
| Block Int [a] -- ^ A block with a width and lines.
| VFill Int a -- ^ A vertically expandable block;
-- when concatenated with a block, expands to height
and the constructors for Block and VFill take an a
rather than a Doc a
as stuffing. In fact, when we construct a block we render its contents and just store the rendered lines. When we merge two blocks, we can then create a superblock that combines their lines.
The problem is, even if we introduced something like literalWide
, this contextual information would be lost once things got inside a block, because of things like
-- | Like 'lblock' but aligned to the right.
rblock :: HasChars a => Int -> Doc a -> Doc a
rblock w = block (\s -> replicateChar (w - realLength s) ' ' <> s) w
which makes the block left-padded with spaces depending on the real lengths of the rendered lines.
So we'd need some kind of large-scale design change in order to introduce a way of changing the context from "wide" to "narrow" for part of the rendered document. Probably the most straightforward approach is to change the type of Block and VFill so they take Doc a
instead of a
as stuffing, as well as an explicit horizontal alignment. But that entails a lot of other changes.
This is my source markdown.
I got following result:
There is a problem on the next line.
and
These results include
|
character.I can modify the source markdown to get the expected result as follows.
However, it is not beautiful.
I think it's a half-width and full-width misjudgment.
◯
and✕
are full width character as well asあ
.Command line
Version