Closed bansp closed 1 year ago
We use the DKPro Core type system for such fundamental information (Sentence, Token, etc.).
The CoNLL-U sent_id is captured by the DKPro Core Sentence
feature id
. It is indeed something that is missing from the TSV 3.2 format. Same for the Token.id
feature.
The newpar
would in DKPro Core not be a sentence-level marker but would probably rather be captured by a Paragraph
annotation. If that were imported from CoNLL-U, then it would probably already be included in the TSV export, but rather as an extra column.
I have been thinking a bit about how to handle sentence-level annotations better in TSV, but the problem/question is how to generically decide whether an annotation is an actual sentence-level annotation or whether it just happens to span an entire sentence. Mind that the TSV writer has no access to the detailed layer setup information such as the "granularity" of a layer since it works at the level of the UIMA CAS.
Would it help to distinguish sentence-level metadata from sentence-wide span annotations, at least at the technical level, with the former being imported from, and serialized as, hash-comments at the top of the sentence (and probably unreachable to the annotator), and the latter being recorded in one of the columns and accessible from inside the tool?
Or am I missing the point (your comment on newpar
being recorded in an extra column seems to go directly contra the simple operational distinction, so I'm wondering).
Have you perhaps been thinking of some other forms of what should be thought as non-span sentence-level annotation (as opposed to metadata)? OK, someone might want to tag sentences as tensed vs. infinitive, etc., or as questions vs. statements -- is that what makes you hesitate? I wonder if, purely technically, such annotations could be considered features of the sentence root in some pseudo-dependency. Or, more neutrally, as a special sort of span. Or I'm blabbing... :-/
The question is more how to represent this in UIMA / in the DKPro Core type system. E.g. as it is right now, a "token-level" annotation is an annotation which has the same start/end offsets as tokens. At the level of UIMA/DKPro Core, there is nothing else which indicates that an annotation is bound to tokens. In INCEpTION, we additionally have the "Granularity" settings which tells us what an an annotation is supposed to be "bound" to (characters, single tokens, multiple tokens, sentence). Document-level annotation work yet again differently. They are modelled as a separate layer type. But we can discern them because they inherit from the UIMA type AnnotationBase
while text-level annotations inherit from the type Annotation
.
An annotation at the token level is not a feature on a token. E.g. a named entity annotation is not a feature on a token. It is a separate annotation which has its own features (i.e. NE type and NE identifier). The "id" is a feature of the Sentence
layer. But a sentence-level annotation is not necessarily a feature of the sentence layer. Primarily, it is an annotation which has the same start/end positions as the sentence and has its own features.
It would be possible to introduce new abstract layer types such as TokenAnnotation
, MultiTokenAnnotation
or SentenceAnnotation
to mirror the granularity information we currently can glean from Annotation
and AnnotationBase
. I'd have to think about what kind of disruptions such a change could cause.
Another possibility could be to introduce a new feature (e.g. owner
) on annotations that (if set) would indicate that the annotation "belongs" to another annotation. E.g. a sentence-level annotation could set its owner
to be the sentence it belongs to. Or the sentence could carry a list of its "metadata" annotations or something like that.
So there are various possibilities how to model that on a technical level. Each has different implications. Not sure yet which path to take.
I think that adding the sentence ID to TSV doesn't require a major version increment. A WebAnno TSV 3.3 should be able to handle that.
The CoNLL-U reader in DKPro Core 1.12.0 and 2.1.0 will support reading paragraphs. So I think this means after upgrading to these versions, paragraph information from a CoNLL-U file imported after the upgrade should be retained as an additional column in the TSV export.
@amir-zeldes WDYT... how much disruption would a WebAnno TSV 3.3 with an optional #sent_id
stanza cause?
@amir-zeldes @bansp the code is ready: https://github.com/webanno/webanno/pull/1546
What remains is deciding when to merge it...
https://github.com/webanno/webanno/pull/1546 includes the sentence ID - nothing more (no paragraph, other sentence metadata, and also not token IDs).
If that's all then it sounds fairly minor to me, I think I can update SNP to support that when I get a chance. If you're really worried about backwards compatibility though, I would just leave the option to export to earlier versions of the format (it's sometimes faster to tell a user of a legacy tool to just do that than to explain to each one what they need to change...)
Introducing lots of switches into the code to allow exporting old versions makes the code less maintainable - I think it is probably not worth the effort.
Also copy/pasting the entire code to retain an "old copy" makes maintenance annoying - I'd do that when introducing a new major version (i.e. TSV 4) but for such minor changes, I would consider the effort to be too high.
Depending on how e.g. SNP is implemented, it might consider all #
lines before the sentence starts to be header lines and try to interpret them as key/value pairs, even if only one key is defined in TSV 3.2. If that is the case, then maybe no change is needed at all if a #Sentence.id
is found because it would simply be ignored.
Definitely no need to worry too much about SNP, as we can maintain that if necessary. It's also pretty simple code, it's Salt that does the heavy lifting - format-in is very self contained code.
@amir-zeldes @bansp the code is ready: webanno/webanno#1546
Thanks a lot, Richard! :-)
Hi all, I'm not sure my question is100% related to this thread. I would like to store additional information at sentence level, for example, the section (title, abstract, body, annex) a certain section is appearing. Is there a way to do so at the moment?
The reason behind this is that I transform to TSV from a richer format, where these sections are identified, and I would like to transform back to the same format once I export from Inception.
Regards Luca
You can have an extra span layer (optionally using the "sentence" granularity), add the features you need to it and pronto. TSV can export/import this.
You could also transform your data to XMI and use the following DKPro Core types to model your original structure:
Since these types are part of DKPro Core, they are internally available to INCEpTION and are preserved from import to export.
If there is interest, we could add the XmlDocumentReader
from DKPro Core as an importer to INCEpTION which parses the XML structure of a document into the types mentioned above and extracts the text for annotation.
Thanks @reckart for the quick answer.
The first option would be fine, is there an example I can see? or I just add something like
#Sentence_section=title
#Text=The ground state of heavily-overdoped non-superconducting La 2−x Sr x CuO 4
1-1 0-3 The _ _ _ _
1-2 4-10 ground _ _ _ _
1-3 11-16 state _ _ _ _
1-4 17-19 of _ _ _ _
[...]
Thank you
Span annotations take up columns - even if their granularity is the sentence.
Try creating a span layer "Sentence metadata" in the project settings, set its granularity to Sentence, add some features, annotate a sentence using your new layer and export it as TSV to see. Alternatively, check the spec where you find the example "Multi-token span annotations and stacked span annotations".
tl;dr: the info is stored repeatedly on every line in the TSV - not nice for your case, but that seems to be the way it is currently implemented.
@reckart thanks!
I wonder if it is possible to use the TSV format to carry in and out some additional information relative to the document itself. E.g. for example the DOI.
Not right now. The format would need to be extended once again.
May I ask what benefits you see in the TSV format over the XMI format at this time? The TSV format has become quite complex over time and is by no means trivial to interpret or to generate. I would have imagined that Python support for XMI in the form of DKPro cassis would almost remove the need for the WebAnno TSV, but apparently there are still reasons to stick with the TSV. I would be curious what these reasons are?
OK, in fact, I don't need it particularly... and I agree that actually it's a bit out of scope for the TSV.
There is no special reason why I'm using the TSV, I started using it at the beginning but now indeed things has changed and it's time to take a step back and I will check the XMI. Thanks for the remark.
I'm not saying that it shouldn't be done for TSV. In fact, one of the reasons why the document metadata feature in INCEpTION is presently (afaik) still marked as experimental is that TSV does not export these annotations. I just wonder about the motivation and impact for such a change which may affect the prio of such a change. I also wonder if the fact that TSV does not support document-level annotations should at this point still be seen as a reason for considering document-level annotations as experimental in INCEpTION.
I think xmi can be a barrier for beginners processing data from Inception. We often have to train new research assistants with basic Python who can quickly grasp what's going on in the TSV, and might take longer to understand the xmi abstractions. It's also handy to have a human-readable format that you can just glance at to figure out if something is broken.
I've checked the XMI and I remember why I chose TSV. As @amir-zeldes mentioned, even though I'm not beginner with XML, I found XMI quite hard to understand. It's literally not made for humans interaction IMHO. Since INCEPTION exports/imports so many formats I ended up choosing TSV.
@lfoppiano Do you often look at your files in a text editor?
@reckart sorry I missed your last questions. Sometimes I do, but it depends on the situation actually. I find XML readable often, but it depends also on the cases... In the case of XMI I checked them to understand the structure and indeed they are not made for be opened by humans :-)
No further extensions to the WebAnno TSV format are planned.
Is your feature request related to a problem? Please describe. When I import documents as CoNNL-U, the sentences come with extra info, in my case #newpar and #sent_id, as in the following:
(You will notice that I sometimes have sequences consisting of more than one sentence, and that #newpar can be inferred from the IDs, so it's not so crucial.)
INCEpTION does a lovely job displaying these IDs as pop-ups over sentence numbers, and since I can see that they are recognized by the system, I would love to keep them in the export. Indeed, the IDs (but not #newpar) are preserved in the export back to CoNNL-U, and I would like to register this as a feature request in the process of pooling ideas for the next version of the WebAnno TSV format, next to #284, #1003 and #1228 .
Describe the solution you'd like I would like to be able to work with a WebAnno TSV 3.x (or 4) format that supports sentence-level metadata, minimally
sent_id
.Potential extensions Potential extensions are described at https://universaldependencies.org/format.html#paragraph-and-document-boundaries (and in the following section).