kakwa / libemf2svg

Microsoft (MS) EMF to SVG conversion library
GNU General Public License v2.0
95 stars 32 forks source link

Wrong font #11

Closed ofirdev closed 7 years ago

ofirdev commented 8 years ago

I have emf file that displays properly in Office but renders with wrong font with libemf2svg. Tested with libemf2svg master on Ubuntu 16.04.

When I export to PDF from Office and run "pdffonts emf.pdf" I'm getting:

name type encoding emb sub uni object ID
ArialMT CID TrueType Identity-H yes no yes 7 0
ABCDEE+Calibri TrueType WinAnsi yes yes no 14 0

Is it possible that font substitution is wrong because it doesn't know the ABCDEE+Calibri or ArialMT fonts? How does libemf2svg knows how to substitute fonts? Can I extend it to test the above fonts? What tools can I use to generate emf and try to reproduce this issue?

ofirdev commented 8 years ago

I can reproduce by simply saving a slide with Hebrew characters in PowerPoint as Enhanced Windows Metafile.

EMF file. The Second line should show "אבגד" in Hebrew: https://drive.google.com/open?id=0B0odzw1WMqkGRVc4V3pvSjNWeVU

PDF exported from PowerPoint: https://drive.google.com/open?id=0B0odzw1WMqkGc2x2Y0V5VHFIQjQ

kakwa commented 8 years ago

Thank you for the report.

There are many issues with text handling right now.

The issue with the hebrew text in your example is more probably an encoding issue (the UTF16 LE strings in EMF files are converted to UTF-8 inside the SVG, but something is probably going wrong).

There are also some issues on text positioning:

ofirdev commented 8 years ago

I've added debug message to uemf_utf.c before line: https://github.com/kakwa/libemf2svg/blob/master/deps/libUEMF-0.1.17/uemf_utf.c#L519

// add #include <inttypes.h>

printf("0x%04" PRIx16 "\n", *src);
printf("%s\n", dst2);

emf file with just the letter א gives me:

0x02a0
ʠ

U_Utf16leToUtf8 conversion from utf-16 to utf-8 is correct but the input is wrong. Why does it get 0x02a0 instead of the real utf-16 representation of the Hebrew letter aleph which is 0x05d0?

emf file with just the letter א https://drive.google.com/open?id=0B0odzw1WMqkGMTFYMnkzRGw1VUk

ofirdev commented 8 years ago

The problematic string has Hebrew charset but the charset is currently ignored: https://github.com/kakwa/libemf2svg/blob/master/src/lib/emf2svg_rec_object_creation.c#L207 I've added a debug message above line 207:

printf("charset: %d\n", logfont.lfCharSet);

The charset is 177 (B1 in hex) which is Windows-1255 (or iso8859-8 ?): https://github.com/kakwa/libemf2svg/blob/df60aef823834037493579bc1988f3e34bd6727f/deps/libUEMF-0.1.17/uemf.h#L1358 https://msdn.microsoft.com/en-us/library/cc250412.aspx

How can we encode according to the charset?

kakwa commented 8 years ago

Nice catch

I will try to work on it this week.

The general idea would be to:

The first two are done in https://github.com/kakwa/libemf2svg/commit/14e31fc251802fdc539a59ea0fa03e5036046ec4 (it's the simple part ^^).

Could I include the emf files you've provided in the pool of test emf files?

ofirdev commented 8 years ago

Thank you for the update.

Text direction should probably be handled too. Without it, we'll get text in reverse order.

Yes, you can include the emf files.

ofirdev commented 7 years ago

Did you have a chance to work on this?

kakwa commented 7 years ago

No, I haven't found the time to work on it, sorry.

kakwa commented 7 years ago

Hello,

I finally had some time to investigate the issue.

I spent a few days trying to play around encoding (even trying every conversion possible in libiconv, including Windows-1255), this wasn't it. I also try to alter Charset EXTCREATEFONTINDIRECTW, it didn't have any effect.

Today I finally realized what was happening exactly:

In the U_EMR_EXTTEXTOUTW record containing the hebrew text, options is set to fOptions: 0x00000010. From the EMF specification, section 2.1.11: Constant definition:

ETO_GLYPH_INDEX = 0x00000010,

Description:

ETO_GLYPH_INDEX: This bit indicates that the codes for characters in an output text string
are actually indexes of the character glyphs in a TrueType font. Glyph indexes are font-
specific, so to display the correct characters on playback, the font that is used MUST be
identical to the font used to generate the indexes.

And indeed, the first character, the alef, which is coded as 0x02a0 (or 672 in decimal) is at index 672 in Arial.ttf (or at least the arial.ttf version I got, keep in mind it's a really really really crappy encoding scheme as inserting a new glyph in the middle of the font can break text display...).

I'm not sure yet how I will solve this issue. The basic idea is to use the cmap table of the truetype font to decode the string (this table gives the mapping between glyph indexes and its corresponding code in a known encoding).

However, there are some small issues:

What I will probably do is providing some static default mappings for well known fonts in emf2svg, and maybe add an option to dynamically load ttf fonts cmap tables.

ofirdev commented 7 years ago

Thank you for looking into it.

MS released some of Windows fonts. It can be installed with the ttf-mscorefonts-installer package under Ubuntu. The fonts-liberation project provide fonts that are metric equivalent to Times, Arial and Courier. Carlito is a replacement for Calibri. Caladea is a replacement for Cambria.

Can you add mapping for other popular Windows fonts that are not part of mscorefonts? For example David font.

LibreOffice devs are working on improved EMF support. Maybe they already have mapping for popular fonts?

kakwa commented 7 years ago

Hello,

there is still a lot of cleanup to do, and some fonts to add, but I think I basically got things working.

However, as expected, there is a order issue (right to left, left to right), and it might be a little annoying to solve as EMF format is a bit of a mess in that regard (right to left can be set in 3 different places...), and options regarding ordering don't seem consistent between SVG rendering implementations.

ofirdev commented 7 years ago

Impressive work. Tested the original EMF file and I'm getting Hebrew characters (in reverse order as you said).

Is it possible to add mapping to additional fonts like David?

kakwa commented 7 years ago

Hello,

It's a bit "brute force" but I've added a bunch of other fonts reverse mappings: https://raw.githubusercontent.com/kakwa/libemf2svg/master/inc/font_mapping.c

Including David font.

I still have the glyphes order to fix however.

ofirdev commented 7 years ago

A simple EMF file with David font works great.

When using Hebrew Nikud the dots in the third line are under the wrong letter. It might be because of the glyphes order: https://drive.google.com/open?id=0B0odzw1WMqkGQzZpamhrRU5QSHc

kakwa commented 7 years ago

Hello, ordering seems OK now.

There are probably some mistakes on the overall logic however (I need to do build some test files to understand how EMF behave regarding to RTL flags and font charsets).

But at least for now, for the real life EMF files I've encountered, it seems good enough.

ofirdev commented 7 years ago

Tested with real files and got very good results.

I've found two edge cases with order of parentheses and specific glyph missing. I'll open separate issues with test files.

Thanks