proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Some thoughts on version management #67

Open kosloot opened 5 years ago

kosloot commented 5 years ago

I think we should spend some time on thinking about version management. With every version, we introduce or change tags, features or the semantics. Of course this is fine, as we add a folia version number to the document so in principle we know what to do.

But there are caveats:

  1. especially older document don't always have a version number. This is forbidden since 1.5 (@proycon am i right?) but all our tools should be aware of this situation, and assign a version number, reflection the situation. (libfolia assigns 1.4.987 as 'magic' number)

  2. a FoLiA processor/program may read a document with some version and then

    • modify it, but staying within the requirement of that version or lower.
    • OR modify it using newer features

This puts some constraints on OUTPUT. In general I think the version should be preserved, except for the case where newer features are introduced. Preserving the version is not that easy though. e.g: Some tags are changing their names. When reading a 1.5 document using the <alignment> tag, we then must output it as <relation> in 2.0 but still as <alignment> for 1.5.

We could choose to output with the current/most recent version of FoLiA, but this is also dangerous, as it might need 'fixing' parts of the document to match new requirements like text consistency and cause trouble for other tools outside our world. who aren't version aware yet.

Providing a tool to update a FoLiA document to the most recent version is part of a solution, but still a lot of older document are out there which will not (or NEVER) updated. Like in SoNaR, Nederlab etc.

Maybe we should also create a policy of NOT removing/renaming tags, at least not between minor versions, as this makes backward compatibility a real PITA.

Conclusion: We really should come up with some strickt guidelines. Discussion welcome.

proycon commented 5 years ago

These are good points to think about indeed.

especially older document don't always have a version number. This is forbidden since 1.5 (@proycon am i right?) but all our tools should be aware of this situation, and assign a version number, reflection the situation. (libfolia assigns 1.4.987 as 'magic' number)

I'm not sure if it's strictly forbidden actually, but a version number should definitely be mandatory. I'll ensure this for v2.0. In practice I fortunately don't see many old documents without version number, considering that all our libraries and tools always output it.

a FoLiA processor/program may read a document with some version and then

  • modify it, but staying within the requirement of that version or lower.
  • OR modify it using newer features

Yes, but in practice we have always done only the latter I think, since the libraries always output the latest version, for which they were designed, regardless of what they read (this is what libfolia does too right?). This indeed might introduce backward-compatibility issues if you really want/need to maintain an older FoLiA version for a document. On the other hand, it would make the libraries more complex if they need to output older versions, and if we do that we have to guarantee it validates according to the older version, which is not trivial, as you already remarked yourself. The forward-forcing behaviour we have now hasn't really been a concern yet, but with FoLiA 2 it might be more so. Still, I'm not sure if it's really worth the effort to implement serialisation of older versions, even though in an ideal world it would surely be the nice thing to do.

Providing a tool to update a FoLiA document to the most recent version is part of a solution, but still a lot of older document are out there which will not (or NEVER) updated. Like in SoNaR, Nederlab etc.

The current v2 libraries should remain backward compatible with all previous FoLiA versions. (unless someone specifically creates a v2 only library perhaps, but it's too early for that). But upon modification and serialisation, they will then output v2 (at least, foliapy does so).

Maybe we should also create a policy of NOT removing/renaming tags, at least not between minor versions, as this makes backward compatibility a real PITA.

Yes, agreed. removing/renaming is a major change, which is why I now only do it for the transition from 1 to 2.

kosloot commented 5 years ago

well, libfolia DOES preserve the version number, which is different from the Python version. Mainly to avoid text inconsistencies. Serializing to 1.5 or higher would imply, among others, fixing text inconsistencies AND recalculating text offsets. This is NOT a simple task! Maybe it is even better to signal the problem and give the user the task to fix this IN ADVANCE. Using a tool we provide. (where fixing offsets is really difficult)

proycon commented 5 years ago

Hmm right, that is indeed a very important point. I wonder if I keep the version at 1.4 then for those older documents, I'd have to check.

proycon commented 5 years ago

I added some compatibility for serialisation of v1 to foliapy (a keepversion parameter on Document instantiation), but it remains limited, but hopefully enough for the transition period. Most important is that people who upgrade FoLiA-tools (for which there is now only the v2 version) are not hindered.