Closed heatherleaf closed 4 years ago
Från dagens diskussion:
Det bör gå att (i config-filen) specificera att text-innehållet i ett element/span ska ersättas med värdet ur ett element-attribut. T.ex.:
<w hidden="***">Hej</w> <w hidden="**">du</w> <w hidden="*****">glade</w><w hidden="*">!</w>
blir då exporterat som
<w>***</w> <w>**</w> <w>*****</w><w>*</w>
The above example can now be achieved by setting
export:
word: <token>:anonymised
and adding the following custom annotation
custom_annotations:
- annotator: misc:anonymise
params:
out: <token>:anonymised
chunk: <token:word>
Example output:
<sentence id="b1ac">
<token pos="IN" baseform="|hej|">***</token>
<token pos="MAD" baseform="|">*</token>
</sentence>
<sentence id="b142">
<token pos="PN" baseform="|den|den här|">***</token>
<token pos="AB" baseform="|här|den här:1|">***</token>
<token pos="VB" baseform="|vara|" sentiment_label="neutral">**</token>
<token pos="DT" baseform="|en|">**</token>
<token pos="NN" baseform="|korpus|">******</token>
<token pos="MAD" baseform="|">*</token>
</sentence>
The preserved_format XML export does not support this feature though. It is unclear how things like
<token>ap<i>berget</i><token>
should be replaced with an arbitrary annotation (that may have a different string length than the input). We are not implementing it for the time being. All other exports (including pretty xml and scrambled xml) can handle this though.
Great! But I have some clarification questions.
The above example can now be achieved by setting
export: word: <token>:anonymised
and adding the following custom annotation
custom_annotations: - annotator: misc:anonymise params: out: <token>:anonymised chunk: <token:word>
What does this mean? Does <token>:anonymised
add an attribute "anonymized" to each "token" element? And does it say that the "token" elements are the ones that are annotated as <word>
in the input corpus?
I.e., does it transform <word>du</word> <word>suger</word>
into <token anonymised="**">du</token> <token anonymised="*****">suger</token>
? (as an intermediate step before actually exporting the corpus)
(I know you don't store the intermediate annotations in XML format, but I hope you understand what I mean)
Example output:
<sentence id="b142"> <token pos="PN" baseform="|den|den här|">***</token> <token pos="AB" baseform="|här|den här:1|">***</token> <token pos="VB" baseform="|vara|" sentiment_label="neutral">**</token>
I assume that it's possible to say that the baseform
attribute should not be exported, right?
The preserved_format XML export does not support this feature though. It is unclear how things like
<token>ap<i>berget</i><token>
should be replaced with an arbitrary annotation (that may have a different string length than the input). We are not implementing it for the time being. All other exports (including pretty xml and scrambled xml) can handle this though.
This is unfortunate, because it's rather important to be able to preserve the whitespace if we want to be able to distribute Twitter annotations.
The simple solution is to simply remove all inner annotations/elements. I.e., in your example, you can just export
<token>********</token>
(assuming that the "anonymizer" replaces all characters by "*" and discards the inner <i>
element)
I'm reopening, in the hope that it will be easy to fix this:)
Great! But I have some clarification questions.
The above example can now be achieved by setting
export: word: <token>:anonymised
and adding the following custom annotation
custom_annotations: - annotator: misc:anonymise params: out: <token>:anonymised chunk: <token:word>
What does this mean? Does
<token>:anonymised
add an attribute "anonymized" to each "token" element? And does it say that the "token" elements are the ones that are annotated as<word>
in the input corpus?I.e., does it transform
<word>du</word> <word>suger</word>
into<token anonymised="**">du</token> <token anonymised="*****">suger</token>
? (as an intermediate step before actually exporting the corpus)(I know you don't store the intermediate annotations in XML format, but I hope you understand what I mean)
Short answer: yes. The custom_annotations
stuff adds a new annotation to each token as in your example (<token anonymised="**">du</token>
). You can read more about custom annotations in the user manual.
This bit defines that the strings in the export should be exchanged for the <token>:anonymised
annotation (this is not documented anywhere yet):
export:
word: <token>:anonymised
Example output:
<sentence id="b142"> <token pos="PN" baseform="|den|den här|">***</token> <token pos="AB" baseform="|här|den här:1|">***</token> <token pos="VB" baseform="|vara|" sentiment_label="neutral">**</token>
I assume that it's possible to say that the
baseform
attribute should not be exported, right?
Yes, of course. There are no mandatory export attributes.
The preserved_format XML export does not support this feature though. It is unclear how things like
<token>ap<i>berget</i><token>
should be replaced with an arbitrary annotation (that may have a different string length than the input). We are not implementing it for the time being. All other exports (including pretty xml and scrambled xml) can handle this though.
This is unfortunate, because it's rather important to be able to preserve the whitespace if we want to be able to distribute Twitter annotations.
The simple solution is to simply remove all inner annotations/elements. I.e., in your example, you can just export
<token>********</token>
(assuming that the "anonymizer" replaces all characters by "*" and discards the inner
<i>
element)I'm reopening, in the hope that it will be easy to fix this:)
Unfortunately I really don't think it's easy to fix. The preserved_format XML export is quite complicated. The way it works is that it refers directly to the indata in order to get every character in the correct position. Therefore there is no such concept of inner elements for this export. I don't think we can/should put any more time into this before the release.
Suggestion 1:
You could make use of the text_headtail
annotation that gives you information about whitespaces occurring before and after each token.
Suggestion 2 (which actually was your own suggestion): Write a script to pre-process and anonymise the export.
Unfortunately I really don't think it's easy to fix. The preserved_format XML export is quite complicated. The way it works is that it refers directly to the indata in order to get every character in the correct position. Therefore there is no such concept of inner elements for this export. I don't think we can/should put any more time into this before the release.
Fair enough :)
Suggestion 1: You could make use of the
text_headtail
annotation that gives you information about whitespaces occurring before and after each token.Suggestion 2 (which actually was your own suggestion): Write a script to pre-process and anonymise the export.
Or suggestion 3: Inform about which tokeniser was used.
Sometimes we want to be able to export annotations to a corpus, without exporting the corpus itself. This could be when the corpus is licensed, but the annotations are free. Examples are:
We need an export format where the words, lemmas, etc are not part, but other annotations are.
One thing to discuss and decide is which annotations are fine to release. E.g., lemmas are not ok, but POS/MSD tags should be. Also, how to do with tokenisation if the original data is raw untokenised.