macmillanpublishers / htmlmaker_js

Repo for testing and finalizing htmlmaker javascript implementation
0 stars 1 forks source link

[6] Htmlmaker_js is not preserving the © (copyright) symbol in some manuscripts #43

Closed mattretzer closed 6 years ago

mattretzer commented 7 years ago

In the xsl html conversion, this symbol is translated in the layout html as: © . In the htmlmaker_js conversion it's just missing. This happened in all of the Swerve tests and most of the Flatiron ones. Example file: https://www.dropbox.com/s/bofkpieghfybijf/9781250109774_MNU_NEW.docx?dl=0

NOTE: the copyright symbol in the tor.com manuscripts WAS included in the htmlmaker_js conversions; the copyright character in those manuscripts is visibly slightly different, so is probably a different ascii character - though it ends up in the html with the same © encoding.

mattretzer commented 7 years ago

Just a progress note on this-- here's what one of these omitted copyright symbols looks like in xml:

Copyright </w:t></w:r><w:r w:rsidRPr="00A84764"><w:sym w:font="Symbol" w:char="F0D3"/></w:r><w:r w:rsidRPr="00A84764"><w:t xml:space="preserve"> 2016 by Christina Saunders

The key bit is the <w:sym w:font="Symbol" w:char="F0D3"/>

As referenced here, this is some sort of Word symbol encoding. I've tried outputting everything from both htmlmaker(js) and the mammoth version of the same, and this char seems to be bypassed entirely; so so far I can't get at it the same way I did soft line break <w:br> encoding.

mattretzer commented 7 years ago

More info on this: as per this mammoth issue, w:sym items are symbols, and are ignored by mammoth when parsing. So there's no straightforward way to do anything with them within htmlmaker_js, and after htmlmaker_js, they are gone from the html.
We could look at parsing the xml in htmlpreprocessing and just inserting a proper UTF-8 copyright symbol (or some sort of placeholder for this symbol so when it gets removed during htmlmaker we know where it was) Waiting to discuss next step with Nellie.

ericawarren commented 7 years ago

@mattretzer possible VBA fix + more info...

Quick fix I put together this VBA thing that fixes those symbols; we could add it to the Character Styles macro (or even add it to every macro) that we roll out with the new section-start styles. Also there are actually TWO copyright symbols in Symbol (serif and sans serif) but I could fix both.

  Dim strCurrentFont As String
  Selection.HomeKey Unit:=wdStory
  With Selection.Find
    .ClearFormatting
    .Text = "^u61651"

    Do While .Execute
      strCurrentFont = Selection.Range.ParagraphStyle
      If Dialogs(wdDialogInsertSymbol).Font = "Symbol" Then
        Selection.InsertSymbol CharacterNumber:=169, Font:=strCurrentFont, Unicode:=True
      Else
        Selection.Collapse wdCollapseEnd
      End If
    Loop
  End With

Longer fix The Symbol font in Word uses the Unicode Private Use Area - and there are actually a bunch of characters in this font that have equivalent HTML entities. If we wanted to be more thorough, we could search the whole range and replace known symbols with their correct Unicode characters (we'd need to create something that maps the Symbol code to the correct Unicode character), and for anything else found in that range we could generate an error for the user telling them that their symbol isn't supported in Bookmaker, or to contact us, or whatever.

Might be overkill if this is something that we can include in the future validator/fixer/reporter, though. @nelliemckesson can weigh in when she's back.

mattretzer commented 7 years ago

Thanks @ericawarren , looks great!

Reviewing options to fix this issue for future discussion with Nellie: 1) Reading the xml and capturing a search-string around the w:sym– my first approach. Now that I have it it may be too flawed: the matching is either too general, or if I handle for greater precision, inefficient and still potentially inexact. 2) Macros are an alternate way to approach this, and probably the most straightforward for now. 3) Opening the .docx file open pre-conversion, inserting placeholder tags into the document.xml to be updated later (or the correct symbol!), and re-zipping the.docx..

MacmillanWorkflows commented 7 years ago

➤ Matthew Retzer commented:

Just putting this here... https://github.com/macmillanpublishers/bookmaker_validator/issues/82 So we can see if we have the same / similar issue with this odd character from that issue.

MacmillanWorkflows commented 7 years ago

➤ Matthew Retzer commented:

4-6 hrs

MacmillanWorkflows commented 6 years ago

➤ Matthew Retzer commented:

Two PR's for this for Luigi to review: https://github.com/macmillanpublishers/bookmaker_addons/pull/182

and this little one: https://github.com/macmillanpublishers/sectionstart_converter/pull/10

MacmillanWorkflows commented 6 years ago

➤ luigi.squillante commented:

both approved

MacmillanWorkflows commented 6 years ago

➤ Matthew Retzer commented:

merged