Allow CDATA in <t> elements

proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions

http://proycon.github.io/folia/

GNU General Public License v3.0

60 stars 10 forks source link

Allow CDATA in <t> elements #12

Closed proycon closed 6 years ago

proycon commented 8 years ago

Not sure if this is already the case.

kosloot commented 8 years ago

For now, this doesn't look to be a good or useful feature. e.g. do we translate the CDATA into text? and what to do on output? What does text() deliver?

Don't implement unless REALLY needed.

proycon commented 8 years ago

Agreed, idea discarded, we won't implement this.

kosloot commented 7 years ago

Why????

proycon commented 7 years ago

It's still something to investigate, as this keeps popping up, and by no means settled. I opened it in response to another user enquiry who expected he could use CDATA. I wonder to what extend strictly disallowing CDATA violates XML specs/conventions.

From W3.org: [Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup.

kosloot commented 7 years ago

Well, Every FoLiA document is valid XML. This does not imply that every XML construct is valid FoLiA !

My biggest concerns are:

does allowing CDATA mean that "empty" text is no longer forbidden?
if so: what should the text() of str() methods return?
in a CDATA you can stuff whatever you want. Base64 encode movies, pictures, scans of books... This doesn't help anyone using the FoLiA, imho
we could handle CDATA as 'garbage in, garbage out' but I am sure that users will ask for (lib)folia methods extracting or inserting such garbage. A slippery slope!
The danger is, that users create a minimal FoLiA with all useful stuff in CDATA and 'unreachable' for other users. Like PDF containing complete OCR images.

So: is there a concrete use case, that cannot be resolved in other ways?

proycon commented 6 years ago

Closing this, idea discarded...