Add HtmlToXliff, XliffToHtml, and HtmlToXliffTests projects

StephenMcConnel commented 7 years ago

Despite the name, the test project covers both programs.

This change is

JohnThomson commented 7 years ago

This is so huge it's very hard to review. A few comments but most are just ideas to consider in future projects, not worth changing this. One high-level thought: I'm reconsidering the idea of making this just a part of L10NSharp. You've created a major piece of work implementing both directions of the recommendations of the Oasis XLIFF 1.2 Representation Guide for HTML, and I would think it very useful to many people quite independent of L10NSharp. I think it deserves its own repo, and it's probably worth an effort to inform the committee responsible for the document; they might even want to refer to it as a possible or even reference implementation.

Reviewed 16 of 16 files at r1. Review status: all files reviewed at latest revision, 10 unresolved discussions.

src/HtmlToXliff/HtmlToXliffConverter.cs, line 35 at r1 (raw file):

      /// solution.  But I'd rather use a NuGet package than clutter up our repository with borrowed code.
      /// </remarks>
      public static void FixHtmlParserBug()

Could this just be static HtmlToXLiffConverter()? in which case it is called before anything is done with the class. But maybe not guaranteed to happen before anything is done with HtmlDocuments?

src/HtmlToXliff/HtmlToXliffConverter.cs, line 85 at r1 (raw file):

              return _xliffDoc;
          }
          var group1 = _xliffDoc.CreateElement("group");

Maybe 'groupForHtml'? and then groupForHead and groupForBody, etc?

src/HtmlToXliff/HtmlToXliffConverter.cs, line 107 at r1 (raw file):

      }

      private void ProcessHtmlElement(XmlElement parentXliffElement, HtmlNode htmlElement)

It's really a parentHtmlElement, too, isn't it? That is, it's the html that corresponds to parentXliffElement, and now we're going to generate children for parentXliffElement from the children of htmlElement?

src/HtmlToXliff/Program.cs, line 20 at r1 (raw file):

          var outfile = Path.ChangeExtension(infile, "xlf");

          HtmlToXliffConverter.FixHtmlParserBug();    // call before loading any HtmlDocument!

If it should always be called, can a static constructor do it?

src/HtmlToXliffTests/TestHtmlToXliff.cs, line 47 at r1 (raw file):

  <p>All rights reserved (c) Gandalf Inc.</p>
 </body>
</html>");

I think it would aid readability if you could insert a comment that actually gives the expected output. I suppose it is more maintainable to do all these asserts rather than just specifying the total output file contents as a string? Of course there could be lots of variants (e.g., whitespace, order of attributes and maybe elements) that would not matter...

src/HtmlToXliffTests/TestHtmlToXliff.cs, line 63 at r1 (raw file):

          Assert.IsNotNull(bodyX);
          Assert.AreEqual(0, bodyX.Attributes.Count);
          Assert.LessOrEqual(1, bodyX.ChildNodes.Count);

Surely we expect exactly one? If the body has no children what happened to our strings that need translation?

src/HtmlToXliffTests/TestHtmlToXliff.cs, line 90 at r1 (raw file):

                          Assert.AreEqual("genid-1", n2.Attributes["id"].Value);
                          Assert.AreEqual("x-html-h1", n2.Attributes["restype"].Value);
                          Assert.AreEqual("title", n2.Attributes["class", HtmlToXliffConverter.kHtmlNamespace].Value);

I'd be tempted to create a method like AssertNameAndAttrs(XmlNode node, string name, params string[] attValPairs) that would let you just AssertNameAndAttrs(n2, "trans-unit", "id", "genid-1", "restype", "x-html-h1", "class", "title").

src/HtmlToXliffTests/TestHtmlToXliff.cs, line 114 at r1 (raw file):

                              Assert.AreEqual("row", n3.Attributes["restype"].Value);
                              int count4 = 0;
                              switch (count3)

Why do we want these increment/case things instead of (outside the foreach) n3_1 = n2.ChildNodes[0]...n3_2 = n2.ChildNodes[1]...?

src/XliffToHtml/XliffToHtmlConverter.cs, line 77 at r1 (raw file):

              else if (node.NodeType == XmlNodeType.Element)
              {
                  Debug.WriteLine("DEBUG: ProcessGroupElement() found something other than group or trans-unit: " + node.Name);

If we will use release builds in our eventual build scripts that use this, seems we should not limit warnings like this.

src/XliffToHtml/XliffToHtmlConverter.cs, line 109 at r1 (raw file):

                  htmlElement.AppendChild(tNode);
              }
              else if (node.Name == "g" || node.Name == "x" || node.Name == "ph")

It's not obvious why we have these three types, since as far as I can see the same output is produced for g and x (and sometimes ph? Can you ever get an element name from ph? The inverse code made it seem ph was always used for translatable attributes...). Maybe it's somewhere in the oasis document; I didn't read everything. But a brief comment outlining (or giving an example) of the main kinds of thing that can occur in a target and how they convert might be helpful.

Comments from Reviewable

hatton commented 7 years ago

I'm not able to review at this time but I'm concerned at what the approach is, based on comments written here: https://docs.google.com/document/d/1UsjVD2gnwPbRhw-j5qahjLW_aKywdU5cKE4LDF1Ks7o/edit?disco=AAAABFJyzbc&ts=59838778

My expectation was that we extract a normal, simple dictionary. Comments there suggest an attempt to instead encode documents in xliff.

StephenMcConnel commented 7 years ago

This appears to not be what we need for Bloom. So I'm closing the PR.

sillsdev / l10nsharp

Add HtmlToXliff, XliffToHtml, and HtmlToXliffTests projects #21