tdf / odftoolkit

Java ODF toolkit project
https://odftoolkit.org/
Apache License 2.0
109 stars 44 forks source link

Semantic DOM (SDOM) API & Search API #87

Open svanteschubert opened 3 years ago

svanteschubert commented 3 years ago

There have to be a high-level API that abstracts from the implementation details of the XML DOM as Simple API & org.odftoolkit.odfdom.doc API but now in addition has to be as well compatible with the new change (collaboration) approach https://tdf.github.io/odftoolkit/odfdom/operations/operations.html (every operation has to be mappable to one or more API calls - best in a generic way (of naming & finding these methods))

This new operation/change approach allows every ODT document to be transformed into an equivalent list of user operations/changes as if the document has been just created from top to bottom by the user. Such a change list in JSON can be currently created by any ODFDOM user of the latest repository (or BETA) calling the JAR with a JDK >=9 (tested with JDK11)

java -jar odfdom-java-1.0.0-BETA1-jar-with-dependencies.jar <USER'S ODT>

Or can be found in the following as an [example]:(https://github.com/tdf/odftoolkit/blob/master/docs/docs/presentations/character-styles.odt) and JSON.

The reason and advantage of switching from a final state (zipped ODT document state) to a more fine granular user-change concept is to be able to answer the most important question of collaboration & to be able to do a merge & synchronize: "What have you changed?".

On top of this higher level SDOM API will be in addition some Search API to query the content of one or more document(s), e.g.

svanteschubert commented 3 years ago

Every ODF user is aware of semantics such as a table, paragraph, image, character, etc. These ODF semantic pieces known to users consists of more than one XML pieces (ie. XML nodes) described by the ODF XML grammar. In other words, XML nodes described by the ODF XML grammar can be abstracted to larger puzzle pieces, which are already known to the end-users or in general exist in any rich format file format.

Therefore, the idea is to define upon the ODF grammar these semantic puzzle pieces.

First, the pattern that identifies the beginning of a new semantic entity, e.g. the XML table:table element for the start of a table.
By this declarative approach, it is desired to generate a SAX parser that transforms an ODT into a sequence of equivalent changes. In general, the XML grammar is being transformed into a set of method calls, representing the possible user changes. But there are more user changes - like deletions or some modifications (e.g. insertColumn) that will never be created when transforming an exiting ODT document to a list of changes. For this reason, some subsets of XML are able to be modified by operations, for instance, "insertColumn()" at a table. The idea is to define in a declarative way the XML change pattern of "insertColumn()" upon the ODF XML grammar to generate source code from it, which allows the transformation from operations to ODT. In the end, a bi-directional transformation from ODT to operations and back should be possible.

My goal is to exchange the existing manually written feature-spaghetti code (every ODF feature in the same single SAX parser) within odfdom/src/main/java/org/odftoolkit/odfdom/changes with a generic version making maintenance possible in the future even for multiple different programming languages (generate Java in the first place, but e.g. C++ generation (or other languages as RUST) should be possible based on the same declarative approach).