Closed ianporada closed 10 months ago
Got it. How are you getting the dict format for the conll-u file? I'm wondering if there's an easier way than putting back the method that goes from dicts to lists, such as perhaps a method that goes directly from the Document
object to a list. However, it's also easy to make sure this still works:
import json
import stanza
from stanza.utils.conll import CoNLL
pipe = stanza.Pipeline("en", processors="tokenize,pos,lemma,depparse")
doc = pipe("This is a test")
stuff = CoNLL.convert_dict(doc.to_dict())
out = json.dumps(stuff, indent=2)
print(out)
[
[
[
"1",
"This",
"this",
"PRON",
"DT",
"Number=Sing|PronType=Dem",
"4",
"nsubj",
"_",
"start_char=0|end_char=4"
],
[
"2",
"is",
"be",
"AUX",
"VBZ",
"Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
"4",
"cop",
"_",
"start_char=5|end_char=7"
],
[
"3",
"a",
"a",
"DET",
"DT",
"Definite=Ind|PronType=Art",
"4",
"det",
"_",
"start_char=8|end_char=9"
],
[
"4",
"test",
"test",
"NOUN",
"NN",
"Number=Sing",
"0",
"root",
"_",
"SpaceAfter=No|start_char=10|end_char=14"
]
]
]
Yeah I am reading the data from a conll-u file:
from stanza.utils.conll import CoNLL
doc = CoNLL.conll2doc(conllu_fname)
doc_as_list = CoNLL.convert_dict(doc.to_dict())
The CoNLL-2012 shared task format stores coreference information in the MISC column using simple round brackets e.g.
1 John _ _ _ _ 0 _ _ (0
2 Bauer _ _ _ _ 1 _ _ 0)
3 works _ _ _ _ 2 _ _ _
4 at _ _ _ _ 3 _ _ _
5 Stanford _ _ _ _ 4 _ _ (1)
6 . _ _ _ _ 5 _ _ _
1 He _ _ _ _ 0 _ _ (0)
2 has _ _ _ _ 1 _ _ _
3 been _ _ _ _ 2 _ _ _
4 there _ _ _ _ 3 _ _ (1)
5 4 _ _ _ _ 4 _ _ _
6 years _ _ _ _ 5 _ _ _
Got it. Some part of me wants to change the interface so that the to_dict()
isn't necessary, but I guess that's creating work for no reason
I noticed the following note in the docs.
I still use convert_dict sometimes because libraries for parsing coreference resolution data from conll files in the CoNLL-2012 Shared Task format, for example using
conll_transform.compute_chains
in boberle/corefconversion, often require accessing the conll data as List[List[List]]. Some datasets as recent as 2020/2021 use this format. That being said I am not sure if this is a use case worth incentivizing, but maybe it would make sense to have some way to convert CoNLL-2012 coreference format into the Stanza format.