Closed lhambrid closed 2 years ago
That sounds pretty bad. Can you share the document with us? Does it have any blank lines?
On Sun, Jan 9, 2022 at 7:27 AM lhambrid @.***> wrote:
Describe the bug English Constituency Parser does not produce output for every line.
To Reproduce I am trying to compare the trees of a pair of documents. refdoc = open(os.path.join('.', 'drive', 'MyDrive', 'en.devtest'), 'r').read().split('\n') hypdoc = open(os.path.join('.', 'drive', 'MyDrive', 'sw-en-hyp.txt'), 'r').read().split('\n') print(len(refdoc), len(hypdoc)) 1013 1013 They are both English and they are both the same length in lines. However, when I run: stanza.download('en') nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency', tokenize_no_ssplit=True) refdoc = nlp(refdoc) hypdoc = nlp(hypdoc) print(len(refdoc.sentences), len(hypdoc.sentences)) 1012 1010
Expected behavior I expect doc.sentences to be the same length as the original unparsed doc, and I expect doc.sentences to be the same length for both docs. This is critical.
Environment:
- OS: Google Colab
- Python version: 3.7
- Stanza version: 1.3.0
Additional context Each line may contain more or less than 1 complete sentence.
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNKQTWILLPBN4WMVKDUVGSOXANCNFSM5LR7AZ7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you are subscribed to this thread.Message ID: @.***>
Ok, it turns out the reference doc has 1 blank line and the hypothesis doc has 3 blank lines, so that explains it. I will just have to fill in those blanks with something. Thanks!
I will suggest installing the dev branch anyway. More accurate and parens will be escaped as LRB / RRB.
On Sun, Jan 9, 2022, 12:21 PM lhambrid @.***> wrote:
Closed #919 https://github.com/stanfordnlp/stanza/issues/919.
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/919#event-5862721387, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLKLFBA4MFIPOXVKK3UVHU5FANCNFSM5LR7AZ7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you commented.Message ID: @.***>
It's actually possible to get them to line up exactly with the following:
refdoc = stuff.read().split("\n")
refdoc = [stanza.Document([], text=d) for d in refdoc]
refdoc = nlp(refdoc)
# same for hypdoc
and to upgrade to the dev branch, you can do this:
pip install git+git://github.com/stanfordnlp/stanza.git@55b48e4ea6cd478b330d64da9a0a0373da2d2e42
Well, I filled in the blanks with a dummy sentence, but the solution you offer is more elegant, so I will keep it in mind for the future. Thanks.
Describe the bug English Constituency Parser does not produce output for every line.
To Reproduce I am trying to compare the trees of a pair of documents.
refdoc = open(os.path.join('.', 'drive', 'MyDrive', 'en.devtest'), 'r').read().split('\n')
hypdoc = open(os.path.join('.', 'drive', 'MyDrive', 'sw-en-hyp.txt'), 'r').read().split('\n')
print(len(refdoc), len(hypdoc))
1013 1013
They are both English and they are both the same length in lines. However, when I run:stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency', tokenize_no_ssplit=True)
refdoc = nlp(refdoc)
hypdoc = nlp(hypdoc)
print(len(refdoc.sentences), len(hypdoc.sentences))
1012 1010
Expected behavior I expect doc.sentences to be the same length as the original unparsed doc, and I expect doc.sentences to be the same length for both docs. This is critical.
Environment:
Additional context Each line may contain more or less than 1 complete sentence.