propbank / propbank-frames

Lexicon of frame files used by Propbank annotation. A searchable, readable version of the latest release is here: http://propbank.github.io/v3.4.0/frames/
Creative Commons Attribution Share Alike 4.0 International
95 stars 27 forks source link

fixing examples and null elements #10

Closed arademaker closed 2 years ago

arademaker commented 2 years ago
  1. the frame files contain 22486 examples.
  2. I would like to take those examples as a corpus with gold annotations
  3. I found many inconsistencies in the null elements and spaces around those null elements markup, suggesting that they were probably edited manually. Many cases do not follow the list of null elements in sec 1.7 from the guidelines.

The current PR fixes many cases that I found but probably more clean up can be necessary.

MarthaSPalmer commented 2 years ago

Alexandre, Katie and I have talked about this and she can make the corrections. But you didn’t run your script on the latest set of frame files. Can you please run it again? Katie and Sameer, can you please make sure Alexandre has access to the latest version, I believe the development version?

And Alexandre, we haven’t forgotten our promise to add you to our PropBank maintenance calls. We were hoping to get our paper submitted and the latest frames released by now, but we have run into some snags. Perhaps you would like to join both, the calls and the paper effort?

Martha

On Jan 27, 2022, at 1:34 PM, Alexandre Rademaker @.***> wrote:



  1. the frame files contain 22486 examples.
  2. I would like to take those examples as a corpus with gold annotations
  3. I found many inconsistencies in the null elements and spaces around those, suggesting that they were probably edited manually. Many cases do not follow the list of null elements in sec 1.7 from the guidelines.

The current PR fixes many cases that I found but probably more clean up can be necessary.


You can view, comment on, or merge this pull request online at:

https://github.com/propbank/propbank-frames/pull/10

Commit Summary

File Changes

(300 fileshttps://github.com/propbank/propbank-frames/pull/10/files)

Patch Links:

— Reply to this email directly, view it on GitHubhttps://github.com/propbank/propbank-frames/pull/10, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABB327UYTJRF5QKARPTH6JDUYGT35ANCNFSM5M67H4IQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you are subscribed to this thread.Message ID: @.***>

sameer-pradhan commented 2 years ago

Alexandre,

You caught us right in the midst of a huge cleanup effort that should make a lot of things consistent and documented over the next few weeks.

Let me try to capture some of the things going on here, and give some historical context. Martha, please correct me if I am wrong.

  1. Examples in frame —The examples were originally created by framers for the sole purpose of frame development. Around the time we started the unification effort, we started inserting real examples from corpora as well. The examples were never intended to be a used as a corpus by itself, but given that it has grown so much, we had been planning on doing that. Originally the examples were flat sentences and the annotatinos were not really aligned with them. Owing to this change, though, we started somewhat automatically making a pass over the examples so that they can actually be first class annotations. Our plan is to keep the examples in the frame files but provide a way to make them consistent and first class annotations so that they can be added to the corpus. Now, there is an interesting overlap here, where some examples are already annotations. And we have tried to identify those in the XML files. The next item is what is in the critical path of this one.

  2. Frame File Specification—Over time what happened is that various accommodations were made in the XML files but they were not always in a consistent fashion and so they ended up creating some redundancy and inconsistency. We are right now in the process of fixing the DTD and making sure that the XML files actually are all valid. In fact, we found many inconsistencies in that and are in the process of semi-automatically fixing them.

  3. Parses for the Exampless—The other thing that these examples are missing are treebank parses. Originally PropBank annotations always pointed to nodes in the tree. Until a few years ago, the dependency on treebank was signigicant, however, with the advent of recent models, that is no longer a necessity. We can now have propositions in the CoNLL seralization without having to have accurate parses. One option would be to create automatic parses by constraining them to match the argument spans—if we have do.

This is the current plan:

A. Cleanup the DTD and XML files interatively so that the DTD is complete and the XML files are all valid. B. Classify the examples in the frame files (I am using XML and frame files interchangeably) in to classes, for example, identify whether they belong to a particular subcorpus and have other layers of annotation, or are standalone, and need to be treated differently.

sameer-pradhan commented 2 years ago

One more thing...

Many of the changes that are happening on our end are being made to a private version of the frames repository. The plan going forward is to periodically update the public frames repository and to keep merging PRs into the private repository. That way the frame files underlying the annotations don't change too frequently so as to cause data inconsistencies in training. And at the same time frame files can also evolve and become richer and more consistent.

MarthaSPalmer commented 2 years ago

I think we should add Alexandre to the repository so that he can view the private version.

And I don’t see why we need to add treebank parses, manual or automatic.

Martha

On Jan 28, 2022, at 6:11 PM, Sameer Pradhan @.**@.>> wrote:

One more thing...

Many of the changes that are happening on our end are being made to a private version of the frames repository. The plan going forward is to periodically update the public frames repository and to keep merging PRs into the private repository. That way the frame files underlying the annotations don't change too frequently so as to cause data inconsistencies in training. And at the same time frame files can also evolve and become richer and more consistent.

— Reply to this email directly, view it on GitHubhttps://github.com/propbank/propbank-frames/pull/10#issuecomment-1024792312, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABB327TM6KNY322CRJ7FTI3UYM5DHANCNFSM5M67H4IQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you commented.Message ID: @.***>

sameer-pradhan commented 2 years ago

I think we should add Alexandre to the repository so that he can view the private version.

Agreed. Though we should wait to fix the XMLs and are at the point where the private repository is just a future version of the public one, rather that one that has artifacts that cannot be traced back to the public one as is the case now.

sameer-pradhan commented 2 years ago

And I don’t see why we need to add treebank parses, manual or automatic.

We don't have to, but reconciling the various examples with data would make it possible to ensure that all the predicates in a given sentence are collected together rather than being captured individually. I mentioned it probably more because in my head there is always a Tree underlying a Proposition annotation. My brain hasn't quite accepted the fact that parses are no longer a necessary fundation for obtaining accurate SRL :-)

arademaker commented 2 years ago

Alexandre, Katie, and I have talked about this and she can make the corrections. But you didn’t run your script on the latest set of frame files. Can you please run it again?

Sure, I can @MarthaSPalmer. Unfortunately, I started with an interactive process using Emacs macros to find and fix patterns. But I can redo the cleanup in new data for sure. We could eventually discuss details such as if we do need the null elements in the examples at all, or which are currently acceptable. But traces in the surface string are quite strange to me, it would make more sense in the parse trees. My interactive process was precisely to fix inconsistencies in the null elements marks and postpone the removal of them to a script.

Perhaps you would like to join both, the calls and the paper effort?

I would be happy to collaborate for sure, thank you for the invitation. As you know, I have been working on forks of both this repo and the propbank-release and I would be happy to join efforts to avoid duplicated work or extra work for later synchronization.

@sameer-pradhan said

we started inserting real examples from corpora as well

Yes, I noticed the src attribute in the example tag in the DTD and in some frame files.

The examples were never intended to be used as a corpus by itself, but given that it has grown so much, we had been planning on doing that.

Oh, so my proposal is not too original! ;-) Indeed, my motivations to process the examples are: 1) examples provide complete coverage of all rolesets compared to the annotated corpora; 2) If the examples were processed by different tools, we can eventually use the analysis to investigate mappings from different syntactic theories (e.g. HPSG or UD) to Propbank.

We are right now in the process of fixing the DTD and making sure that the XML files actually are all valid.

The current XML files in the MASTER branch are valid according to the DTD in the same branch. I wrote a parser in Haskell for them and I had checked that with

% for f in *.xml; do xmllint --noout --dtdvalid frameset.dtd $f ; done

But I agree that we do have some redundancy and maybe something could be improved. For example, the note tag could be more restricted and the example tag may have an ID?

One option would be to create automatic parses by constraining them to match the argument spans—if we have do.

My first experiment was to parse all examples (after the cleanup) with the HPSG English Grammar (http://delph-in.github.io/delphin-viz/demo/). I was able to have at least 1 analysis for ~85% of the examples. Next, I would need to do precisely what you mentioned. The mapping from the syntactic/semantics of ERG to the annotations in the examples. So avoid have

[She]-2 wanted trace-2 to avoid the morale-damaging public disclosure that a trial would bring.

removing the marks and parsing, I got

image

Besides the fact that ARG0 of Propbank is ARG1 in ERG etc.. the mapping for this example is almost perfect and allows me to infer that the predicate _avoid_v_1 from ERG can be mapped to avoid.01 roleset. But getting the span from the parsing tree and compare with the example tags is not trivial currently. Even for this example, the ARG0 is not the pronoun but the trace mark.

Not sure, but I would prefer to avoid complex and long examples from corpora, it would not be a duplication of work anyway. I would prefer simple and short examples similar to WN examples. So examples would continue to be part of documentation but with a little more formal analysis/annotation to ensure consistency. PS: I can't find it now, but I am almost sure I found an example where the predicate being exemplified didn't occur.

Many of the changes that are happening on our end are being made to a private version of the frames repository. The plan going forward is to periodically update the public frames repository and to keep merging PRs into the private repository. That way the frame files underlying the annotations don't change too frequently so as to cause data inconsistencies in training. And at the same time frame files can also evolve and become richer and more consistent.

Surely you do have reasons for the decision. I only suspect, my 2 cents, that having a private repo ends up creating an overlap of work and some waste of time as we just saw with this PR. I am not complaining, I am just wondering if the goals you have above could not be achieved with a contributing guideline (see example) and with the proper use of branches.

My brain hasn't quite accepted the fact that parses are no longer a necessary fundation for obtaining accurate SRL :-)

I am 100% like you! ;-) Moreover, we avoid the possible ambiguity in the interpretation of examples because we would be fixing the desired reading for the example.

arademaker commented 2 years ago

Any update here? Sorry for insisting... ;-)

sameer-pradhan commented 2 years ago

We appreciate your help in fixing the examples in the frame files. We had a meeting last Friday and discussed this at length. There are related issues that might be helpful to separate out and discuss. The following three sections do just that.

PRIVATE vs PUBLIC

The original reason for keeping a private repository was to "hide" unfinished and incomplete endpoints being accidentally taken up by enthusiatic users without realizing its implications on existing data. You are right in that keeping a repository private adds to the inertia of incorporating contributions from knowledgeable users such as yourself. The idea is to achieve a sweet spot between potentially delaying contributions and "misuse" of the data.

What we decided would be best is to have a public branch (with clear documentation) that exists as (somewhat dislocated) metadata until it is merged with the data to form a "release-able" bundle. And if people still use it, it is their problem.

FIXING EXAMPLES

Unfortunately, this PR is against an older version of PB frames schema. In the recent past, we have been working towards integrating various fragmented schemas from different PB subprojects into one cohesive whole. This has caused us to modify the XML schema which is now quite different than the one used for the frames in this repository. It is currently part of the "proposed34" branch of the "propbank-development" repository. We are still actively pushing fixes to this branch with the aim to merge it into the "main" branch very soon. We plan to fix the redundancy and only have one repository for the frame files (both for release and development)

We have fixed many of the examples semi-automatically, and in that process realized that many of these examples were not intended to be machine readable or the text had been corrupted and could not be semi-automatically fixed. So, we have identified many such cases and have created a list of the ones that absolutely need to be fixed manually.

If you could wait one more week, we believe that the repository will be in a much better place for you to make contributions. At that point there would still be some examples that need to be fixed manually and you can help us with that.

HPSG PARSER

Your intention behind parsing examples using another parser is not quite clear to us.

The point of having examples in the frame files is to provide the annotators with all the information that they need to make for annotating said cases. So, as far as we know, all the examples would be accompanied with complete set of arguments that go with that predicate. Is is not clear what you are trying to accomplish by using the parser. Maybe you were planning on inserting the traces where they may not be present? Or, maybe it helps with adapting the guidelines to dependency parses?

It is the opinion of the lead linguist who is managing the frame files that the existence of traces in examples is not crucial from the annotator prespective. There are gradually diminishing number of examples that are now left where we might need to manually insert/fix traces. We have been very conservative in making the fixes. However if traces are not significant, we might safely leave fixing them out for now. Or, add over time as and when the opportunity arises, or merge contributions from you (and others) through PRs on the appropriate branch that can be merged back almost seamlessly without causing more work on our end.

What are your thoughts on traces?

arademaker commented 2 years ago

What we decided would be best is to have a public branch (with clear documentation) that exists as (somewhat dislocated) metadata until it is merged with the data to form a "release-able" bundle. And if people still use it, it is their problem.

IMHO, GitHub provides many alternatives. We can create regular 'releases' (in https://github.com/propbank/propbank-frames/releases the last one is from 2016). We can certainly use branches (master, dev etc).

If you could wait one more week, we believe that the repository will be in a much better place for you to make contributions. At that point there would still be some examples that need to be fixed manually and you can help us with that.

Sure. Of course your team has much more experience them me and eventually have reasons for the decisions taken so far. I tend to prefer small examples (created from scratch or simplified from the corpus). Ideally, we could have a web interface to browse the treebank and bring corpus sentences for a given roleset. But simples examples are eventually easier to understand and could provide complete coverage of all rolsets an the cases with complete or partial ARGs.

as far as we know, all the examples would be accompanied with complete set of arguments that go with that predicate

Yes, but I do like examples with incomplete arguments too.

Is is not clear what you are trying to accomplish by using the parser. Maybe you were planning on inserting the traces where they may not be present? Or, maybe it helps with adapting the guidelines to dependency parses?

We all know that syntactic theories diverge. In particular, Propbank and HPSG surely do not agree on the valence analysis of all English verbs. See, for instance, some discussion from @anncopestake in https://aclanthology.org/E09-1001/ or the thesis "Learning Bayesian Networks for Inference of Semantic Verb Classes" from Sergio Roa (Saarland University, 2007). My goal is similar to Sergio's. If I can produce some data-driven mapping from the propbank/verbnet verbs to the English Resource Grammar predicates, I can: 1) use map the ERG treebanks to propbank annotation to train SRL models; 2) use ERG to annotate new data and map these annotations to PropBank annotation to train SRL models; 3) use the ERG predicates as a pivot, I can also improve the mappings from WN to propbank/verbnet since I am completing the glosstag corpus from Princeton Wordnet (sense annotation all examples and definitions) and I have already parsed all examples and definitions from the Princeton Wordnet with ERG too. I also wrote about this in https://delphinqa.ling.washington.edu/t/mapping-erg-predicates-to-propbank-verbnet/576/12. I hope the ideas are clear; if not, I would be happy to schedule a call to discuss them and get your feedback.

t is the opinion of the lead linguist who is managing the frame files that the existence of traces in examples is not crucial from the annotator perspective.

I agree, and I would prefer not to have them, or at least have the raw examples and eventually the annotated examples (with trace marks, syntactic analyses, etc.). I need only the plain text without marks for my goal (parse the examples with ERG grammar).

arademaker commented 2 years ago

closing it given that frames were updated in the new release and the proposed changes were used in the new release anyway.