Good Morning. I realize the UMR notation encodes a lot of information incorporating many areas of research. I found some parts of the notation confusing, and I thought it might be useful to identify potential sources of confusion and ask questions and offer some feedback. Thank you in advance for your replies.
Capitalization and Naming Conventions
I noticed some inconsistencies in the guidelines related to capitalization and naming of concepts, relations, and attribute values:
Sometimes concepts and relations are capitalized: Singular, Paucal
Sometimes all caps is used: AUTH, PRESENT_REF
Sometimes lowercase is used: :ref-number, imperative (Imperative is also used)
Sometimes camel-case is used to distinguish separate words: :FullAff
Sometimes hyphens are used to distinguish separate words: :ref-person
Sometimes underscores are used to distinguish separate words: PRESENT_REF
AMR concepts and relations use lowercase words separated by hyphens. The only exceptions are core-role relations (:ARGX) to make them visually stand out and names in quotes. Even attribute values such as imperative and expressive are lowercase in standard AMR.
With that in mind I would request:
I would strongly encourage you to keep the AMR naming convention in UMR (lowercase names with hyphens between words). That will reduce errors in downstream applications in the long run, reduce annotator typos from inconsistent capitalization, and improve the backward-compatibility of UMR and AMR. A slight change from AMR’s conventions might be fine as long as you are consistent and the convention is clear and simple.
Abbreviations and Acronyms
Some of UMR’s notations rely on acronyms and difficult-to-read abbreviations such as DCT, AUTH, PrtAff, and modstr. AMR is designed to be human-readable, which is important for its use as an explanatory tool, and I believe it also reduces the learning curve for reading and annotating AMR. I would also stress that these guidelines won’t just be used by linguists, but also computer scientists who want to be able to read or parse UMR.
With that in mind I would request:
Please try to almost never use acronyms in the UMR representation and use abbreviations only in moderation. Instead of DCT, AUTH, PrtAff, and modstr, you could write doc-creation-time, author, :partial-affirm, and :modal-strength for example.
Rather than using linguistic acronyms in the guidelines, I would spell them out, e.g., Tense, Aspect, Mood instead of TAM.
Transliteration
You have added a nice notation for transliterating words such as for annotating low-resource languages:
(e / enhleama-00 'travel'
…
I like this notation for transliteration, and I think it will be very useful for annotating and using multilingual UMR data. However, please be aware that this notation changes the AMR data structure and many code libraries for reading, writing, and representing the AMR data will need to be updated to even be able to run on UMR inputs with this notation. For example, Smatch and penman will fail to run if you try to run their current code on a UMR with transliteration. A possible workaround could be to represent transliteration as attributes, e.g. e / enhleama-00 :transl "travel", at least until this notation is supported in libraries that currently work on AMR (I think you could do this in a post-processing script rather than changing the notation).
Questions/Requests:
Do you want the transliteration to be stored in the UMR data structure or is this only for readability for annotators?
Document-Level Representation
Similarly with transliteration, the notation for document-level representations that is show in the guidelines will not be supported by current code for AMRs because a notation like :temporal((DCT :depends-on s1t2) (s1t2 :contained s1t)) isn't supported. If these representations are always connected graphs, it might be good to make them conform to AMR notation.
Questions/Requests:
Is there one document-level graph per document or per sentence? If it's one per document, should the root note be called document rather than sentence?
The guidelines use node IDs like s1a to refer to node a in the first sentence. Could you add a dot between s1 and a to make it s1.a? I think that would make it visually easier to read and it would clarify that s1 is the namespace of a.
Could you include concepts in the document-level graph as well, just for readability? So, instead of s1t2, you can write (s1.t2 / today).
Temporal Relations
I found the notation of :after and :before confusing at first. I read the relation A :after B as “A happens after B happens”, but according to the guidelines, it is the other way around. I think it’s easier in English to read it as “A happens after B happens” and other people might be confused by this as well.
Questions/Requests:
Would it make sense to switch the direction of :after and :before relations so that A :after B can be read as “A happens after B” and A :before B can be read as “A happens before B”?
In one or two places in the guidelines you use a relation :op in (b / before :op (n / now)). I would change that to :op1 to stay consistent with AMR.
Good Morning. I realize the UMR notation encodes a lot of information incorporating many areas of research. I found some parts of the notation confusing, and I thought it might be useful to identify potential sources of confusion and ask questions and offer some feedback. Thank you in advance for your replies.
Capitalization and Naming Conventions
I noticed some inconsistencies in the guidelines related to capitalization and naming of concepts, relations, and attribute values:
Singular
,Paucal
AUTH
,PRESENT_REF
:ref-number
,imperative
(Imperative
is also used):FullAff
:ref-person
PRESENT_REF
AMR concepts and relations use lowercase words separated by hyphens. The only exceptions are core-role relations (:ARGX
) to make them visually stand out and names in quotes. Even attribute values such asimperative
andexpressive
are lowercase in standard AMR.With that in mind I would request:
Abbreviations and Acronyms
Some of UMR’s notations rely on acronyms and difficult-to-read abbreviations such as
DCT
,AUTH
,PrtAff
, andmodstr
. AMR is designed to be human-readable, which is important for its use as an explanatory tool, and I believe it also reduces the learning curve for reading and annotating AMR. I would also stress that these guidelines won’t just be used by linguists, but also computer scientists who want to be able to read or parse UMR.With that in mind I would request:
DCT
,AUTH
,PrtAff
, andmodstr
, you could writedoc-creation-time
,author
,:partial-affirm
, and:modal-strength
for example.Transliteration
You have added a nice notation for transliterating words such as for annotating low-resource languages:
I like this notation for transliteration, and I think it will be very useful for annotating and using multilingual UMR data. However, please be aware that this notation changes the AMR data structure and many code libraries for reading, writing, and representing the AMR data will need to be updated to even be able to run on UMR inputs with this notation. For example, Smatch and penman will fail to run if you try to run their current code on a UMR with transliteration. A possible workaround could be to represent transliteration as attributes, e.g.
e / enhleama-00 :transl "travel"
, at least until this notation is supported in libraries that currently work on AMR (I think you could do this in a post-processing script rather than changing the notation).Questions/Requests:
Document-Level Representation
Similarly with transliteration, the notation for document-level representations that is show in the guidelines will not be supported by current code for AMRs because a notation like
:temporal((DCT :depends-on s1t2) (s1t2 :contained s1t))
isn't supported. If these representations are always connected graphs, it might be good to make them conform to AMR notation.Questions/Requests:
document
rather thansentence
?s1a
to refer to nodea
in the first sentence. Could you add a dot betweens1
anda
to make its1.a
? I think that would make it visually easier to read and it would clarify that s1 is the namespace of a.s1t2
, you can write(s1.t2 / today)
.Temporal Relations
I found the notation of
:after
and:before
confusing at first. I read the relationA :after B
as “A happens after B happens”, but according to the guidelines, it is the other way around. I think it’s easier in English to read it as “A happens after B happens” and other people might be confused by this as well.Questions/Requests:
:after
and:before
relations so thatA :after B
can be read as “A happens after B” andA :before B
can be read as “A happens before B”?:op
in(b / before :op (n / now))
. I would change that to:op1
to stay consistent with AMR.