Closed mattretzer closed 6 years ago
Just a progress note on this-- here's what one of these omitted copyright symbols looks like in xml:
Copyright </w:t></w:r><w:r w:rsidRPr="00A84764"><w:sym w:font="Symbol" w:char="F0D3"/></w:r><w:r w:rsidRPr="00A84764"><w:t xml:space="preserve"> 2016 by Christina Saunders
The key bit is the <w:sym w:font="Symbol" w:char="F0D3"/>
As referenced here, this is some sort of Word symbol encoding. I've tried outputting everything from both htmlmaker(js) and the mammoth version of the same, and this char seems to be bypassed entirely; so so far I can't get at it the same way I did soft line break <w:br>
encoding.
More info on this: as per this mammoth issue, w:sym items are symbols, and are ignored by mammoth when parsing. So there's no straightforward way to do anything with them within htmlmaker_js, and after htmlmaker_js, they are gone from the html.
We could look at parsing the xml in htmlpreprocessing and just inserting a proper UTF-8 copyright symbol (or some sort of placeholder for this symbol so when it gets removed during htmlmaker we know where it was)
Waiting to discuss next step with Nellie.
@mattretzer possible VBA fix + more info...
Quick fix I put together this VBA thing that fixes those symbols; we could add it to the Character Styles macro (or even add it to every macro) that we roll out with the new section-start styles. Also there are actually TWO copyright symbols in Symbol (serif and sans serif) but I could fix both.
Dim strCurrentFont As String
Selection.HomeKey Unit:=wdStory
With Selection.Find
.ClearFormatting
.Text = "^u61651"
Do While .Execute
strCurrentFont = Selection.Range.ParagraphStyle
If Dialogs(wdDialogInsertSymbol).Font = "Symbol" Then
Selection.InsertSymbol CharacterNumber:=169, Font:=strCurrentFont, Unicode:=True
Else
Selection.Collapse wdCollapseEnd
End If
Loop
End With
Longer fix The Symbol font in Word uses the Unicode Private Use Area - and there are actually a bunch of characters in this font that have equivalent HTML entities. If we wanted to be more thorough, we could search the whole range and replace known symbols with their correct Unicode characters (we'd need to create something that maps the Symbol code to the correct Unicode character), and for anything else found in that range we could generate an error for the user telling them that their symbol isn't supported in Bookmaker, or to contact us, or whatever.
Might be overkill if this is something that we can include in the future validator/fixer/reporter, though. @nelliemckesson can weigh in when she's back.
Thanks @ericawarren , looks great!
Reviewing options to fix this issue for future discussion with Nellie: 1) Reading the xml and capturing a search-string around the w:sym– my first approach. Now that I have it it may be too flawed: the matching is either too general, or if I handle for greater precision, inefficient and still potentially inexact. 2) Macros are an alternate way to approach this, and probably the most straightforward for now. 3) Opening the .docx file open pre-conversion, inserting placeholder tags into the document.xml to be updated later (or the correct symbol!), and re-zipping the.docx..
➤ Matthew Retzer commented:
Just putting this here... https://github.com/macmillanpublishers/bookmaker_validator/issues/82 So we can see if we have the same / similar issue with this odd character from that issue.
➤ Matthew Retzer commented:
4-6 hrs
➤ Matthew Retzer commented:
Two PR's for this for Luigi to review: https://github.com/macmillanpublishers/bookmaker_addons/pull/182
and this little one: https://github.com/macmillanpublishers/sectionstart_converter/pull/10
➤ luigi.squillante commented:
both approved
➤ Matthew Retzer commented:
merged
In the xsl html conversion, this symbol is translated in the layout html as:
©
. In the htmlmaker_js conversion it's just missing. This happened in all of the Swerve tests and most of the Flatiron ones. Example file: https://www.dropbox.com/s/bofkpieghfybijf/9781250109774_MNU_NEW.docx?dl=0NOTE: the copyright symbol in the tor.com manuscripts WAS included in the htmlmaker_js conversions; the copyright character in those manuscripts is visibly slightly different, so is probably a different ascii character - though it ends up in the html with the same
©
encoding.