stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.27k stars 891 forks source link

Constituency Parser does not produce output for every line #919

Closed lhambrid closed 2 years ago

lhambrid commented 2 years ago

Describe the bug English Constituency Parser does not produce output for every line.

To Reproduce I am trying to compare the trees of a pair of documents. refdoc = open(os.path.join('.', 'drive', 'MyDrive', 'en.devtest'), 'r').read().split('\n') hypdoc = open(os.path.join('.', 'drive', 'MyDrive', 'sw-en-hyp.txt'), 'r').read().split('\n') print(len(refdoc), len(hypdoc)) 1013 1013 They are both English and they are both the same length in lines. However, when I run: stanza.download('en') nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency', tokenize_no_ssplit=True) refdoc = nlp(refdoc) hypdoc = nlp(hypdoc) print(len(refdoc.sentences), len(hypdoc.sentences)) 1012 1010

Expected behavior I expect doc.sentences to be the same length as the original unparsed doc, and I expect doc.sentences to be the same length for both docs. This is critical.

Environment:

Additional context Each line may contain more or less than 1 complete sentence.

AngledLuffa commented 2 years ago

That sounds pretty bad. Can you share the document with us? Does it have any blank lines?

On Sun, Jan 9, 2022 at 7:27 AM lhambrid @.***> wrote:

Describe the bug English Constituency Parser does not produce output for every line.

To Reproduce I am trying to compare the trees of a pair of documents. refdoc = open(os.path.join('.', 'drive', 'MyDrive', 'en.devtest'), 'r').read().split('\n') hypdoc = open(os.path.join('.', 'drive', 'MyDrive', 'sw-en-hyp.txt'), 'r').read().split('\n') print(len(refdoc), len(hypdoc)) 1013 1013 They are both English and they are both the same length in lines. However, when I run: stanza.download('en') nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency', tokenize_no_ssplit=True) refdoc = nlp(refdoc) hypdoc = nlp(hypdoc) print(len(refdoc.sentences), len(hypdoc.sentences)) 1012 1010

Expected behavior I expect doc.sentences to be the same length as the original unparsed doc, and I expect doc.sentences to be the same length for both docs. This is critical.

Environment:

  • OS: Google Colab
  • Python version: 3.7
  • Stanza version: 1.3.0

Additional context Each line may contain more or less than 1 complete sentence.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNKQTWILLPBN4WMVKDUVGSOXANCNFSM5LR7AZ7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

lhambrid commented 2 years ago

Ok, it turns out the reference doc has 1 blank line and the hypothesis doc has 3 blank lines, so that explains it. I will just have to fill in those blanks with something. Thanks!

AngledLuffa commented 2 years ago

I will suggest installing the dev branch anyway. More accurate and parens will be escaped as LRB / RRB.

On Sun, Jan 9, 2022, 12:21 PM lhambrid @.***> wrote:

Closed #919 https://github.com/stanfordnlp/stanza/issues/919.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/919#event-5862721387, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLKLFBA4MFIPOXVKK3UVHU5FANCNFSM5LR7AZ7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

AngledLuffa commented 2 years ago

It's actually possible to get them to line up exactly with the following:

refdoc = stuff.read().split("\n")
refdoc = [stanza.Document([], text=d) for d in refdoc]
refdoc = nlp(refdoc)
# same for hypdoc

and to upgrade to the dev branch, you can do this:

pip install git+git://github.com/stanfordnlp/stanza.git@55b48e4ea6cd478b330d64da9a0a0373da2d2e42
lhambrid commented 2 years ago

Well, I filled in the blanks with a dummy sentence, but the solution you offer is more elegant, so I will keep it in mind for the future. Thanks.