stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.28k stars 892 forks source link

Stanza Document model to dataframe #825

Open sangeethsn opened 3 years ago

sangeethsn commented 3 years ago

Hi, I got this following output from NER process. I want this in the form of a dataframe .In that case,"id,text,upos,xpos,ner" shoud be column names.Is that possible to convert into dataframe?

[ [ { "id": 1, "text": "[", "upos": "PUNCT", "xpos": "-LRB-", "start_char": 0, "end_char": 1, "ner": "O" }, { "id": 2, "text": "'", "upos": "PUNCT", "xpos": "''", "start_char": 1, "end_char": 2, "ner": "O" } ], [ { "id": 1, "text": "OLD", "upos": "ADJ", "xpos": "NNP", "feats": "Degree=Pos", "start_char": 2, "end_char": 5, "ner": "B-FAC" }, { "id": 2, "text": "COAST", "upos": "PROPN", "xpos": "NNP", "feats": "Number=Sing", "start_char": 6, "end_char": 11, "ner": "I-FAC" }, { "id": 3, "text": "BRIDGE", "upos": "PROPN", "xpos": "NNP", "feats": "Number=Sing", "start_char": 12, "end_char": 18, "ner": "I-FAC" }, { "id": 4, "text": "1", "upos": "NUM", "xpos": "CD", "feats": "NumForm=Digit|NumType=Card", "start_char": 19, "end_char": 20, "ner": "E-FAC" },

AngledLuffa commented 3 years ago

There's no specific mechanism in Stanza to put the data in a dataframe, but you can access the individual words and do whatever you want with them, including putting them in a dataframe:

import stanza

pipe = stanza.Pipeline("en", processors="tokenize,pos,ner")

doc = pipe("Ragavan punched Teferi in his face")
print(doc)

for sentence in doc.sentences:
    for token in sentence.tokens:
        for word in token.words:
            # instead of printing, could put the rows in a dataframe
            print(word.id, word.text, word.upos, word.xpos, token.ner)
sangeethsn commented 3 years ago

Thank you so much

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mohamad-quteifan commented 2 years ago

Are you able to apply a stanza pipline to a pandas dataframe and convert the outputs to other columns ?

AngledLuffa commented 2 years ago

What would you want that function to do?

One issue to consider is that the columns present are determined by which processors are used, which won't always be the same for different users or different language's pipelines.

mohamad-quteifan commented 2 years ago

I have a dataframe with a column that has keywords that I am trying to tag (POS) using Stanza and then return the output of each row in a new column with the tags attached. Then pull the nouns and adjectives from the new pos column

AngledLuffa commented 2 years ago

There's already a bit of a question here - are you looking to tag individual words, or are the words in a sentence? The tool doesn't handle individual words; what happens for "bush", for example, or "dog", "cut", etc

On Thu, Jul 14, 2022 at 10:09 AM Mohamad Quteifan @.***> wrote:

I have a dataframe with a column that has keywords that I am trying to tag (POS) using Stanza and then return the output of each row in a new column with the tags attached. Then pull the nouns and adjectives from the new pos column

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/825#issuecomment-1184684820, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIM6623AU7EG5JLBKDVUBCTVANCNFSM5FTDQICA . You are receiving this because you commented.Message ID: @.***>

mohamad-quteifan commented 2 years ago

Okay, gotcha. It's individual words so that is where my problem lies

AngledLuffa commented 2 years ago

Indeed, that will not work with our model. I do not think such a thing exists, to be honest. For example, in the sentence "that will not work with our model", it is clear "model" is a noun, but if I wrote "how do you model heat loss in a cup of tea", "model" is now a verb.

On Thu, Jul 14, 2022 at 1:14 PM Mohamad Quteifan @.***> wrote:

Okay, gotcha. It's individual words so that is where my problem lies

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/825#issuecomment-1184855524, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKQ5ND5LA5AZIJGGPTVUBYKXANCNFSM5FTDQICA . You are receiving this because you commented.Message ID: @.***>

mohamad-quteifan commented 2 years ago

Oh yes ofcourse, I understand that. I am just using it to mark the words. I was using NLTK and it was doing a terrible job at identifying the terms but I found a different model that effectively predicts the words POS. (I am using the SennaTagger).