tarsqi / ttk

Tarsqi Toolkit
Apache License 2.0
25 stars 10 forks source link

What to do with the DCT? #10

Closed marcverhagen closed 7 years ago

marcverhagen commented 8 years ago

In the TTK format, the DCT is now saved as a TIMEX3 tag immediately inside of ttk_tags (but outside of any doc_element tag). Is there value in changing the TTK format top level by adding a metadata tag alongside the existing tags (text, source_tags and ttk_tags). In the metadata tag we could then either put the DCT TIMEX3 as before or do the following <metadata><dct>20160226</dct></metadata>.

We also need to answer what to do if there is an actual time expression in the document text that can be tagged as the DCT. Would we still have the dct in the metadata section, in which case we add the opportunity to introduce an inconsistency, or do we then not have the dct in the metadata, which makes finding the DCT more involved. I am leaning towards the first.

reevesr commented 8 years ago

I see the value of having the DCT in metadata and also logic for finding if exists in the document text so that an inconsistency could be detected. I guess the decision in this case would need to be should an error (or maybe a warning) be thrown, or should one of the two DCTs be preferred in cases of conflict -- The preference could be done on the basis of preferring one source over the other. There is the possibility that one may have more precision that the other (say one is a date-time and the other a date, in the case where the date part is the same) so maybe it should be a two-tiered preference: prefer precision first and otherwise prefer source.

marcverhagen commented 8 years ago

We will use a metadata tag, not quite like the above, but close, using an attribute to store the DCT value: <metadata><dct value="20160226"></dct></metadata>. We allow TIMEX3 tags of text spans to be DCT timexes and one could imagine a metadata parser to seek out those tags. But Tarsqi will not add an empty TIMEX3-DCT tag to the tarsqi tags just so that there is a DCT amongst the tags.

When this is added, an update to source parsers and metadata parsers is going to be needed.

marcverhagen commented 8 years ago

Done in https://github.com/tarsqi/ttk/commit/6b0f81cc31790fb0c31d2013becb0870e68a9114

marcverhagen commented 8 years ago

Re-opened because it turns out that Blinker tries to generate TLINKs between TIMEX3 tags including the DCT. Which would be hard to do with the DCT just inside the metadata. So probably add a non-consuming tag with id="t0". Do we mind if this is potentially the second DCT in a document in case there was one before? This relates to issue https://github.com/tarsqi/ttk/issues/13.