stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.31k stars 896 forks source link

[QUESTION] I still sometimes use convert_dict #1329

Closed ianporada closed 10 months ago

ianporada commented 10 months ago

I noticed the following note in the docs.

Note:
convert_dict is now marked as deprecated, as internally we use the Document object everywhere. If you have a use case where you need it, please let us know!

I still use convert_dict sometimes because libraries for parsing coreference resolution data from conll files in the CoNLL-2012 Shared Task format, for example using conll_transform.compute_chains in boberle/corefconversion, often require accessing the conll data as List[List[List]]. Some datasets as recent as 2020/2021 use this format. That being said I am not sure if this is a use case worth incentivizing, but maybe it would make sense to have some way to convert CoNLL-2012 coreference format into the Stanza format.

AngledLuffa commented 10 months ago

Got it. How are you getting the dict format for the conll-u file? I'm wondering if there's an easier way than putting back the method that goes from dicts to lists, such as perhaps a method that goes directly from the Document object to a list. However, it's also easy to make sure this still works:

import json
import stanza
from stanza.utils.conll import CoNLL

pipe = stanza.Pipeline("en", processors="tokenize,pos,lemma,depparse")
doc = pipe("This is a test")
stuff = CoNLL.convert_dict(doc.to_dict())
out = json.dumps(stuff, indent=2)
print(out)
[
  [
    [
      "1",
      "This",
      "this",
      "PRON",
      "DT",
      "Number=Sing|PronType=Dem",
      "4",
      "nsubj",
      "_",
      "start_char=0|end_char=4"
    ],
    [
      "2",
      "is",
      "be",
      "AUX",
      "VBZ",
      "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
      "4",
      "cop",
      "_",
      "start_char=5|end_char=7"
    ],
    [
      "3",
      "a",
      "a",
      "DET",
      "DT",
      "Definite=Ind|PronType=Art",
      "4",
      "det",
      "_",
      "start_char=8|end_char=9"
    ],
    [
      "4",
      "test",
      "test",
      "NOUN",
      "NN",
      "Number=Sing",
      "0",
      "root",
      "_",
      "SpaceAfter=No|start_char=10|end_char=14"
    ]
  ]
]
ianporada commented 10 months ago

Yeah I am reading the data from a conll-u file:

from stanza.utils.conll import CoNLL

doc = CoNLL.conll2doc(conllu_fname)
doc_as_list = CoNLL.convert_dict(doc.to_dict())

The CoNLL-2012 shared task format stores coreference information in the MISC column using simple round brackets e.g.

1       John    _       _       _       _       0       _       _       (0
2       Bauer   _       _       _       _       1       _       _       0)
3       works   _       _       _       _       2       _       _       _
4       at      _       _       _       _       3       _       _       _
5       Stanford        _       _       _       _       4       _       _       (1)
6       .       _       _       _       _       5       _       _       _

1       He      _       _       _       _       0       _       _       (0)
2       has     _       _       _       _       1       _       _       _
3       been    _       _       _       _       2       _       _       _
4       there   _       _       _       _       3       _       _       (1)
5       4       _       _       _       _       4       _       _       _
6       years   _       _       _       _       5       _       _       _
AngledLuffa commented 10 months ago

Got it. Some part of me wants to change the interface so that the to_dict() isn't necessary, but I guess that's creating work for no reason