jonorthwash / ud-annotatrix

GNU General Public License v3.0
59 stars 49 forks source link

determine how to handle converting between formats #498

Open jonorthwash opened 1 year ago

jonorthwash commented 1 year ago

Currently there are some issues related to converting between formats.

One problem with formats is that converting between them is always lossy. Even between CoNLL-U and CG3, quite a bit is lost. For example, only CoNLL-U supports enhanced dependencies and a difference between X/UPOSTAGS, and CG3 and CoNLL-U handle subtokens differently (and store different information about them, I think?).

So if the user would like to edit the corpus in a different format, and we try to preserve some of the information not native to that format in an underlying format, then when they modify the number or position of tokens, or modify information related to non-visible information, then things could easily get lost, or at least lost track of.

We have a few options for how to deal with this:

  1. We could just leave it as is, where data loss just always happens,
  2. We could make it harder to switch formats—or at least to switch formats and edit the new format. Perhaps make different formats view-only by default, and then display a modal when the user tries to start editing in a different format than the corpus is "stored in" (or was originally in), along the lines of "You will lose data—only proceed if you're okay with that!"
  3. We could try to keep track of data that is going to be lost more carefully so that it's only really ever lost if the user does something that disrupts a particular token or the ability to keep track of associated data. As opposed to just replacing the stored corpus with the new format. This would require implementing a better "format-neutral" way of storing data than what is already in notatrix.

What is preferred? Other ideas?

jonorthwash commented 1 year ago

Note from @ftyers, @mr-martian, and @TinoDidriksen: Enhanced dependencies are possible in CG3 using relations.

jonorthwash commented 1 year ago

@ftyers prefers 2 or 3. I suggest 3 as the end goal, but maybe going with 2 as an easier short-term goal / a stop-gap for now.