udapi / udapi-python

Python framework for processing Universal Dependencies data
GNU General Public License v3.0
57 stars 31 forks source link

Reading speaker and other sentence-level comments #115

Closed ianporada closed 1 year ago

ianporada commented 1 year ago

Sometimes a CorefUD 1.1 document has speaker information as a sentence-level comment below the sent_id, example below. Is there a way to recover this information from a Document?

# sent_id = sentence_1
# speaker = speaker_1
ianporada commented 1 year ago

I cannot find a way. It seems like this would need to be added as an additional root comment (e.g. $SPEAKER). Or more generally maybe it makes sense to store all sentence-level comments with the root as a string in case there are others of interest too.

martinpopel commented 1 year ago

Some CoNLL-U comments are standardized and exposed in Udapi API, e.g. root.sent_id, root.text, root.newpar or root.newdoc. The remaining comments are stored in root.comment, which is a (possibly multi-line) string corresponding to all the comment lines, but excluding the # characters. So to extract the speaker you need to use something like

speaker = None
match = re.search("^ speaker = (.+)", tree.comment, re.M)
if match:
    speaker = match.group(1)
ianporada commented 1 year ago

I see, thanks! I was confused by the fact that standardized comments are replaced by tags in the comment attribute, but understand now. https://github.com/udapi/udapi-python/blob/a9050283fe1530e9f14dcbe5ffc10e64b2f85eae/udapi/block/read/conllu.py#LL42C29-L42C41