goodmami / penman

PENMAN notation (e.g. AMR) in Python
https://penman.readthedocs.io/
MIT License
141 stars 27 forks source link

processing :: in the meta-data #145

Open jheinecke opened 3 months ago

jheinecke commented 3 months ago

AMR files usually start with an id and the sentence before the actual PENMAN graph comes

# ::id any-ID-001.1
# ::snt the cat is sleeping
# ::save-date Sat Jul 20, 2024 ::file test_0001_2.txt
( s / cleep-01
   :ARG0 ( c / cat))

the penman lib parses this without any problem and provides it in the metadata dictionary. Multiple ::keys are parsed correctly

However I cam across sentences which contain ::

# ::snt this must be separated  using :: unless it is a single line
...

unfortunately penman-lib cuts the sentence at the :: and creates a metadata-entry with a space as key. For other comment lines having mulitple keys is OK, but for the line containing ::snt is forbids having sentences with ::. Could this be changed?

goodmami commented 3 months ago

This part of parsing is separate from that of parsing the PENMAN notation and I don't have any real grammar defined, so there is likely room for improvement:

https://github.com/goodmami/penman/blob/1f52cbc7166ae6e9bb98a5382e45936436c12a9a/penman/_parse.py#L87-L99

For other comment lines having mulitple keys is OK, but for the line containing ::snt is forbids having sentences with ::

What I'm hearing is that you think ::snt is a special case that must appear on its own comment line, or at least as the last metadata key on a line. Is that correct?

While I'd agree that it's unfortunate that :: can't currently appear as a literal in a metadata value, can you link to any docs or code showing that ::snt is indeed treated specially?

jheinecke commented 3 months ago

Hi, yes, your paraphrase is correct, and no I have no link which specifies whether or not ::snt is a special case. I just stumbled across when annotating a sentence with a badly formed smiley which happened to contain ::. I agree it's rare. Personally I'd prefer a single ::key per comment line, no special cases, but this would mean reformatting the LDC data where multiple keys happen to be on the same comment line

bact commented 3 months ago

Does it possible to escaped the :?

For example, if ::snt is intended, we may have:

# ::snt this must be separated using \:\: unless it is a single line

But this also means more work to unescaping strings before further processing.

Having a single ::key per comment line is probably more maintainable.

jheinecke commented 3 months ago

I have just found an indirect hint in the AMR 3.0 documentation that ::snt is on a single line (cited from amr_annotation_3.0/docs/README.txt of the LDC2020T02 Dataset): (my emphasizing)

2.3 Structure and content of individual AMRs

Each AMR-sentence pair in the ./data/amrs files comprises the following data and fields:

  • Header line containing a unique workset-sentence ID for the source string that has been AMR annotated (::id), a completion timestamp for the AMR (::date), an anonymized ID for the annotator who produced the AMR (::annotator), and a marker for the AMRs of dually-annotated sentences indicating whether the AMR is the preferred representation for the sentence (::preferred)

  • Header line containing the English source sentence that has been AMR annotated (::snt)

    • Header line indicating the date on which the AMR was last saved (::save-date), and the file name for the AMR-sentence pair (::file)
  • Graph containing the manually generated AMR tree for the source sentence (see the AMR guidelines for a full description of the structure and semantics of AMR graphs).

In the LDC data ::save-date and ::file occur in the same line as do ::id, ::date and ::annotator, for instance

# ::id wiki-minicorpus-a_0001.2 ::date 2017-10-17T03:05:10 ::annotator SDL-AMR-09 ::preferred # ::snt Like all pitcher plants, it is carnivorous and uses its nectar to attract insects that drown in the pitcher and are digested by the plant. # ::save-date Sat Jan 20, 2018 ::file wiki-minicorpus-a_0001_2.txt (a / and ....

If we read the documentation strictly, then there are at least three comment lines with, one of which only contains ::snt :satisfied:

goodmami commented 3 months ago

@bact Keep in mind that we are not proposing a new format, but working with an existing one. And escaping the : characters does prevent the splitting, but there is no mechanism currently for unescaping them unless you do it yourself.

@jheinecke Thanks for digging up that reference. While it doesn't give explicit parsing instructions, it does hint at the expected format.

I'm thinking of passing some configurable that indicates which metadata keys are full-line (to help with both parsing and formatting). I'd like to put this information in the AMR model instead of built-in to the parser, but currently the code is not set up to handle that, so some more changes would be needed.