Open jheinecke opened 3 months ago
This part of parsing is separate from that of parsing the PENMAN notation and I don't have any real grammar defined, so there is likely room for improvement:
For other comment lines having mulitple keys is OK, but for the line containing
::snt
is forbids having sentences with::
What I'm hearing is that you think ::snt
is a special case that must appear on its own comment line, or at least as the last metadata key on a line. Is that correct?
While I'd agree that it's unfortunate that ::
can't currently appear as a literal in a metadata value, can you link to any docs or code showing that ::snt
is indeed treated specially?
Hi,
yes, your paraphrase is correct, and no I have no link which specifies whether or not ::snt
is a special case. I just stumbled across when annotating a sentence with a badly formed smiley which happened to contain ::
. I agree it's rare.
Personally I'd prefer a single ::key
per comment line, no special cases, but this would mean reformatting the LDC data where multiple keys happen to be on the same comment line
Does it possible to escaped the :
?
For example, if ::snt
is intended, we may have:
# ::snt this must be separated using \:\: unless it is a single line
But this also means more work to unescaping strings before further processing.
Having a single ::key
per comment line is probably more maintainable.
I have just found an indirect hint in the AMR 3.0 documentation that ::snt
is on a single line (cited from amr_annotation_3.0/docs/README.txt
of the LDC2020T02 Dataset): (my emphasizing)
2.3 Structure and content of individual AMRs
Each AMR-sentence pair in the ./data/amrs files comprises the following data and fields:
Header line containing a unique workset-sentence ID for the source string that has been AMR annotated (::id), a completion timestamp for the AMR (::date), an anonymized ID for the annotator who produced the AMR (::annotator), and a marker for the AMRs of dually-annotated sentences indicating whether the AMR is the preferred representation for the sentence (::preferred)
Header line containing the English source sentence that has been AMR annotated (::snt)
- Header line indicating the date on which the AMR was last saved (::save-date), and the file name for the AMR-sentence pair (::file)
Graph containing the manually generated AMR tree for the source sentence (see the AMR guidelines for a full description of the structure and semantics of AMR graphs).
In the LDC data ::save-date
and ::file
occur in the same line as do ::id
, ::date
and ::annotator
, for instance
# ::id wiki-minicorpus-a_0001.2 ::date 2017-10-17T03:05:10 ::annotator SDL-AMR-09 ::preferred # ::snt Like all pitcher plants, it is carnivorous and uses its nectar to attract insects that drown in the pitcher and are digested by the plant. # ::save-date Sat Jan 20, 2018 ::file wiki-minicorpus-a_0001_2.txt (a / and ....
If we read the documentation strictly, then there are at least three comment lines with, one of which only contains ::snt
:satisfied:
@bact Keep in mind that we are not proposing a new format, but working with an existing one. And escaping the :
characters does prevent the splitting, but there is no mechanism currently for unescaping them unless you do it yourself.
@jheinecke Thanks for digging up that reference. While it doesn't give explicit parsing instructions, it does hint at the expected format.
I'm thinking of passing some configurable that indicates which metadata keys are full-line (to help with both parsing and formatting). I'd like to put this information in the AMR model instead of built-in to the parser, but currently the code is not set up to handle that, so some more changes would be needed.
AMR files usually start with an id and the sentence before the actual PENMAN graph comes
the penman lib parses this without any problem and provides it in the
metadata
dictionary. Multiple ::keys are parsed correctlyHowever I cam across sentences which contain
::
unfortunately penman-lib cuts the sentence at the
::
and creates a metadata-entry with a space as key. For other comment lines having mulitple keys is OK, but for the line containing::snt
is forbids having sentences with::
. Could this be changed?