spraakbanken / sparv-pipeline

Språkbanken's text analysis tool
https://spraakbanken.gu.se/sparv
MIT License
25 stars 6 forks source link

Prevent namespace-clash #47

Open anne17 opened 4 years ago

anne17 commented 4 years ago

Vad händer om man i indatan har en annotation som heter samma sak som en Sparvannotation (inkl name-space)? Ska vi bygga in en kontroll för det? Vi kan prefixa sparv-annotationen med sparv i det fallet? Det borde skickas ut en varning åtminstone och kanske ett tips om att man kan byta namn på sitt kolliderande indata-element.

Undersök om man kan byta namn på annotationsfiler av indata-element!

anne17 commented 4 years ago

We could have a flag that when turned on prefixes all automatic annotations with sparv. This could be useful for users who want to have a clear distinction between existing annotations and annotations added by Sparv.

anne17 commented 4 years ago

It is now possible to add namespaces to all source and/or automatic annotations via the config options export.source_namespace and export.sparv_namespace. Sparv will also try to resolve colliding attribute names by adding a default namespace for sparv annotations. If that's not possible the user will receive a warning.

Detecting and warning for colliding element names is much harder though. If the user has an element in the source that has the same name as a Sparv annotation (including namespace, e.g. swener.ne), the source element will be overwritten when Sparv does its analysis. I think we should warn the user if this occurs, but how and when?

"Colliding" elements are okay if Sparv only adds attributes to existing elements (they aren't really colliding in this case), but they are problematic if Sparv invents new spans for an existing element (because this means that the element is overwritten). However, from only checking an export list we cannot know whether Sparv will invent new spans.

MartinHammarstedt commented 3 years ago

We can automatically figure out which annotators create new spans and store this information in the registry. A util function can use this info and warn about collisions.