Put support for structured annotations back in.

Klortho commented 12 years ago

In going from contextmodel → datadictionary, we lost support for structured comment annotations inside the DTD. This feature needs to be put back in.

Klortho commented 12 years ago

See contextmodel/src/gov/ncbi/pmc/dtdanalyzer/ElementModelManager.java, lines 355ff, for how they used to be handled; and .../dtdanalyzer/DTDEventHandler.java, line 286 (the comment() method) for where the new code should go.

Use test/split-example.dtd for testing.

Klortho commented 12 years ago

I got most of the way done with this. The output of the split-example looks very close to the split-mockup.daz.xml that I added last week. Here are some loose ends, as well as a few changes I would like to make to the output format:

Need to add Markdown processing (see #8). This is a big loose end, but I played with pandoc some today, and am hoping it will be pretty easy to integrate. It will mean calling out to the system, and is a pretty hairy dependency, so it might be nice to make it optional.
Figure out how to do the autolinking, so that, for example "<split>" is turned into a link to the document page for that. This would either be done by pre-processing before sending to pandoc, or by doing a pandoc extension. If we change the syntax a little, to add some special delimiters, it would make it easier, and would also solve the problem of how we distinguish between "<em>" when it should be link to this DTD's <em> element, and when it should be interpreted as HTML.
Need to handle exception cases -- syntax errors in the comments. Right now I have a bunch of FIXMEs. I think the default behavior should be to just drop that one annotation comment, but it would be nice to add a command-line option "--strict" that causes the analyzer to exit abruptly.
The "!dtd" comment, which annotates the top-level, is really just a special-case "!module" comment. We now know which module is the master DTD. So I am thinking that the for this should be moved from the top-level of the output, where it is now, to under the <module> element corresponding to the DTD. Then, that <module> element will get a new attribute dtd="true".
[This one is covered by #16] Instead of using "!dtd" and "!module" in the module comments, let the user put anything at all that doesn't match <elem>, @attr, %pent;, or &gent;. Then, they could write the comment like this:
```
  <!--~~ split-example.dtd
    ....
```
The "split-example.dtd" will be ignored by the comment parser, but is more human-friendly than "!dtd".
If we can guarantee that the names of the modules will be unique, then we can take the systemId and publicId out of the location information on all of the items, and just use attributes module and lineNumber. For example, change
```
  <declaredIn systemId="file:///home...Analyzer/test/split-example/split-example.dtd" 
      publicId="-//NLM//external dtd dummy public id//EN" 
      lineNumber="42"/>
```
to:
```
  <declaredIn module='split-example.dtd' lineNumber="42"/>
```

Klortho commented 12 years ago

As far as error checking / handling go, these are some things to check for:

The same creature having two different annotation blocks associated with it. We won't allow this.
Any mal-formedness in the structure of the structured comments

Klortho commented 12 years ago

Update to structured comment processing:

I implemented two new command-line options to control this. Since pandoc is being used as an external executable, and maybe not everybody will want to install and use it, it is now an opt-in option (off by default). To get it, use the option "-m" (for markdown). You could also use "--docproc 'my-processor'" to use any other processor you want. You could also use it without any processing, in which case you could write the annotations in well-formed XHTML, and they will be copied to the output.
Autolinking is done as follows (these are illustrated in the current version of split-example.dtd):
- Element tags must be preceded with a backtick, like this: `<split>. That allows the processor to distinguish between elements you want auto-linked, and HTML.
- Attribute: @instrument (note that Github markdown screws this one up.)
- Parameter entity: %banana.ent;
- General entity: &fleegle-pic;
To disable any of these, just precede them with a backslash. E.g. `<split>, \@instrument, \%banana.ent;, or \&fleegle-pic;

Still to do

[✓ done] Document the new command-line options on the README.
[→ #17] Document all the syntax for the structured comments, including:
- What used to be "!dtd" and "!module"; specifying DTD title there;
- The pandoc-flavored markdown syntax; and
- the special syntax for linking to elements, attributes, and entities (as described above).
Make sure all exception handling is correct; look for FIXMEs.
Implement strict mode. When strict mode is on, any errors while processing the comments will cause the program to abort. When off, then any errors will just cause that comment section to be dropped.
It would be nice to be able to check for well-formedness problems coming out of Markdown, before the XSLT transformation step, so that we could provide better error reporting. For example, if you put a bare tag like "<a>" in the structured comment, without a closing tag, it will be copied verbatim into the output, which will result in non-well-formed XHTML.

ahamelers commented 12 years ago

Wouldn't it be easier to just require the annotations be in HTML, rather than Markdown? Then it wouldn't require the use of additional processors, etc.

Klortho commented 12 years ago

That's the default: "You could also use it without any processing, in which case you could write the annotations in well-formed XHTML, and they will be copied to the output."

So nobody is required to write annotations in Markdown. Markdown is a lot more readable than XHTML, though, so that's why I thought it would be a nice option. Jeff was adding a lot of wiki-like syntax to the current annotations, and rather than go down that route, I think, it would be a lot better to use something that's a de-facto standard.

Klortho commented 12 years ago

This is done. "strict mode", that I described above, is now the only mode. If your comments are not well-formed, the tool will die. It now also checks each comment for validity independently, using a SAXParser, and if not well-formed, it will choke early and report the exact file and line number of the offending comment.

ncbi / DtdAnalyzer

Put support for structured annotations back in. #3