Export of annotations without the source corpus

heatherleaf commented 4 years ago

Sometimes we want to be able to export annotations to a corpus, without exporting the corpus itself. This could be when the corpus is licensed, but the annotations are free. Examples are:

Svensk trädbank (STB): It's built upon SUC, which requires a license, but the STB trees are free.
Twitter: According to the license, we are not allowed to distribute the tweets, but only the ids of the tweets. But if we run Sparv on our Twitter corpora we are allowed to distribute our annotations, and then someone who wants to use them can download the tweets themselves from Twitter.
In principle all automatic annotations we get from Sparv on licensed corpora. The annotations are not licensed, so should be possible to release freely.

We need an export format where the words, lemmas, etc are not part, but other annotations are.

One thing to discuss and decide is which annotations are fine to release. E.g., lemmas are not ok, but POS/MSD tags should be. Also, how to do with tokenisation if the original data is raw untokenised.

heatherleaf commented 4 years ago

Från dagens diskussion:

Det bör gå att (i config-filen) specificera att text-innehållet i ett element/span ska ersättas med värdet ur ett element-attribut. T.ex.:

<w hidden="***">Hej</w> <w hidden="**">du</w> <w hidden="*****">glade</w><w hidden="*">!</w>

blir då exporterat som

<w>***</w> <w>**</w> <w>*****</w><w>*</w>

anne17 commented 4 years ago

The above example can now be achieved by setting

export:
    word: <token>:anonymised

and adding the following custom annotation

custom_annotations:
    - annotator: misc:anonymise
      params:
          out: <token>:anonymised
          chunk: <token:word>

Example output:

    <sentence id="b1ac">
      <token pos="IN" baseform="|hej|">***</token>
      <token pos="MAD" baseform="|">*</token>
    </sentence>
    <sentence id="b142">
      <token pos="PN" baseform="|den|den här|">***</token>
      <token pos="AB" baseform="|här|den här:1|">***</token>
      <token pos="VB" baseform="|vara|" sentiment_label="neutral">**</token>
      <token pos="DT" baseform="|en|">**</token>
      <token pos="NN" baseform="|korpus|">******</token>
      <token pos="MAD" baseform="|">*</token>
    </sentence>

The preserved_format XML export does not support this feature though. It is unclear how things like

<token>ap<i>berget</i><token>

should be replaced with an arbitrary annotation (that may have a different string length than the input). We are not implementing it for the time being. All other exports (including pretty xml and scrambled xml) can handle this though.

heatherleaf commented 4 years ago

Great! But I have some clarification questions.

The above example can now be achieved by setting
export:
 word: <token>:anonymised
and adding the following custom annotation
custom_annotations:
 - annotator: misc:anonymise
 params:
 out: <token>:anonymised
 chunk: <token:word>

What does this mean? Does <token>:anonymised add an attribute "anonymized" to each "token" element? And does it say that the "token" elements are the ones that are annotated as <word> in the input corpus?

I.e., does it transform <word>du</word> <word>suger</word> into <token anonymised="**">du</token> <token anonymised="*****">suger</token>? (as an intermediate step before actually exporting the corpus)

(I know you don't store the intermediate annotations in XML format, but I hope you understand what I mean)

heatherleaf commented 4 years ago

Example output:

    <sentence id="b142">
      <token pos="PN" baseform="|den|den här|">***</token>
      <token pos="AB" baseform="|här|den här:1|">***</token>
      <token pos="VB" baseform="|vara|" sentiment_label="neutral">**</token>

I assume that it's possible to say that the baseform attribute should not be exported, right?

heatherleaf commented 4 years ago

The preserved_format XML export does not support this feature though. It is unclear how things like
<token>apberget<token>
should be replaced with an arbitrary annotation (that may have a different string length than the input). We are not implementing it for the time being. All other exports (including pretty xml and scrambled xml) can handle this though.

This is unfortunate, because it's rather important to be able to preserve the whitespace if we want to be able to distribute Twitter annotations.

The simple solution is to simply remove all inner annotations/elements. I.e., in your example, you can just export

<token>********</token>

(assuming that the "anonymizer" replaces all characters by "*" and discards the inner  element)

I'm reopening, in the hope that it will be easy to fix this:)

anne17 commented 4 years ago

Great! But I have some clarification questions.
The above example can now be achieved by setting
export:
 word: <token>:anonymised
and adding the following custom annotation
custom_annotations:
 - annotator: misc:anonymise
 params:
 out: <token>:anonymised
 chunk: <token:word>
What does this mean? Does <token>:anonymised add an attribute "anonymized" to each "token" element? And does it say that the "token" elements are the ones that are annotated as <word> in the input corpus?

I.e., does it transform <word>du</word> <word>suger</word> into <token anonymised="**">du</token> <token anonymised="*****">suger</token>? (as an intermediate step before actually exporting the corpus)

(I know you don't store the intermediate annotations in XML format, but I hope you understand what I mean)

Short answer: yes. The custom_annotations stuff adds a new annotation to each token as in your example (<token anonymised="**">du</token>). You can read more about custom annotations in the user manual. This bit defines that the strings in the export should be exchanged for the <token>:anonymised annotation (this is not documented anywhere yet):

export:
    word: <token>:anonymised

anne17 commented 4 years ago

Example output:
 <sentence id="b142">
 <token pos="PN" baseform="|den|den här|">***</token>
 <token pos="AB" baseform="|här|den här:1|">***</token>
 <token pos="VB" baseform="|vara|" sentiment_label="neutral">**</token>
I assume that it's possible to say that the baseform attribute should not be exported, right?

Yes, of course. There are no mandatory export attributes.

anne17 commented 4 years ago

The preserved_format XML export does not support this feature though. It is unclear how things like
<token>apberget<token>
should be replaced with an arbitrary annotation (that may have a different string length than the input). We are not implementing it for the time being. All other exports (including pretty xml and scrambled xml) can handle this though.
This is unfortunate, because it's rather important to be able to preserve the whitespace if we want to be able to distribute Twitter annotations.

The simple solution is to simply remove all inner annotations/elements. I.e., in your example, you can just export
<token>********</token>
(assuming that the "anonymizer" replaces all characters by "*" and discards the inner  element)

I'm reopening, in the hope that it will be easy to fix this:)

Unfortunately I really don't think it's easy to fix. The preserved_format XML export is quite complicated. The way it works is that it refers directly to the indata in order to get every character in the correct position. Therefore there is no such concept of inner elements for this export. I don't think we can/should put any more time into this before the release.

Suggestion 1: You could make use of the text_headtail annotation that gives you information about whitespaces occurring before and after each token.

Suggestion 2 (which actually was your own suggestion): Write a script to pre-process and anonymise the export.

heatherleaf commented 4 years ago

Unfortunately I really don't think it's easy to fix. The preserved_format XML export is quite complicated. The way it works is that it refers directly to the indata in order to get every character in the correct position. Therefore there is no such concept of inner elements for this export. I don't think we can/should put any more time into this before the release.

Fair enough :)

Suggestion 1: You could make use of the text_headtail annotation that gives you information about whitespaces occurring before and after each token.

Suggestion 2 (which actually was your own suggestion): Write a script to pre-process and anonymise the export.

Or suggestion 3: Inform about which tokeniser was used.

spraakbanken / sparv-pipeline

Export of annotations without the source corpus #31