names of the profiles in the last ERG treebanks

goodmami / mrs-to-penman

Utilities for converting MRS data to the PENMAN serialization of DMRS

MIT License

2 stars 1 forks source link

names of the profiles in the last ERG treebanks #5

Open arademaker opened 2 years ago

arademaker commented 2 years ago

This is related to https://github.com/delph-in/docs/issues/40, and maybe @olzama and @danflick can add something.

The names of the ERG gold profiles in the tsdb/gold changed. The http://svn.delph-in.net/erg/tags/2020/etc/redwoods.xls didn't preserve the old names, which is pretty confusing. So, for example, wsj06c now is only wsj06, right?

How were the dev, test, and train sets defined for https://github.com/goodmami/mrs-to-penman/blob/master/convert-redwoods.sh#L8-L187? The new names can impact the dev/test/train sets?

arademaker commented 2 years ago

I ended up with the following list

testsuites=(
  "+csli"
  "+esd"
  "+fracas"
  "+mrs"
  "+trec"

  "?cb"
  "+ecoc"
  "+ecos"
  "=ecpa"
  "?ecpr"

  "+hike"
  "+jh"
  "?jhk"
  "?jhu"

  "+tg"
  "?tgk"
  "?tgu"
  "+ps"
  "?psk"
  "?psu"
  "?rondane"

  "+rtc000"
  "+rtc001"

  "+bcs"
  "+ccs"
  "+control"
  "+scm"
  "+peted"
  "?petet"

  "+pest"
  "+omw"
  "+ntucle"
  "+handp12"
  "+sh-spband-r"
  "+sh-spec"

  "+vm6"
  "+vm13"
  "+vm31"
  "?vm32"
  "+wlb03"
  "+wnb03"

  "+ws201"
  "+ws202"
  "+ws203"
  "+ws204"
  "+ws205"
  "+ws206"
  "+ws207"
  "+ws208"
  "+ws209"
  "+ws210"
  "+ws211"
  "=ws212"
  "?ws213"
  "?ws214"

  "+wsj00"
  "+wsj01"
  "+wsj02"
  "+wsj03"
  "+wsj04"
  "+wsj05"
  "+wsj06"
  "+wsj07"
  "+wsj08"
  "+wsj09"
  "+wsj10"
  "+wsj11"
  "+wsj12"
  "+wsj13"
  "+wsj14"
  "+wsj15"
  "+wsj16"
  "+wsj17"
  "+wsj18"
  "+wsj19"
  "=wsj20"
  "?wsj21"
  "+wsj23"
)

arademaker commented 2 years ago

In https://arxiv.org/pdf/1904.11564.pdf, you wrote

About half of the training data comes from the Wall Street Journal (sections 00-21), while the rest spans a range of domains, including Wikipedia, e- commerce dialogues, tourism brochures, and the Brown corpus. The data is split into training, development and test sets with 72,190, 5,288, and 10,201 sentences, respectively.

once the script executed, I counted the graphs with:

(venv) ar@tenis mrs-to-penman % rg "^\(\)"  | wc -l
    2297
(venv) ar@tenis mrs-to-penman % rg "^\([0-9]"  | wc -l
   69319

So I am missing (72190+5288+10201)-69319 = 18,360 sentences...

The profiles sum up ..

% find profiles -name 'item.*' | xargs gzcat | wc -l
  131401

goodmami commented 2 years ago

How were the dev, test, and train sets defined for https://github.com/goodmami/mrs-to-penman/blob/master/convert-redwoods.sh#L8-L187?

These were taken from the redwoods.xls file linked in the comment above the code you linked to here.

Regarding the new distribution of the Redwoods 2020 data, I don't really know what changed or why, so I cannot comment on your proposed list.

Regarding the counts, a few things:

The number of lines in the item files are not always a good indicator of the number of items. Some of those may be specified as to be ignored (e.g., when they contain non-linguistic data scraped from a web page, like a table of numbers). You should filter on those where i-wf is 1.
MRSs that could not be converted to DMRS were dropped (possibly an ill-formed MRS)
Duplicate MRSs were dropped (as noted in the appendix of https://aclanthology.org/N19-1235/)

These may account for the discrepancies you saw.

arademaker commented 2 years ago

Sorry, I was reading the profile inputs but I should read the results:

% find profiles -name 'item.*' | xargs gzcat | wc -l
  131401
% find profiles -name 'result.*' | xargs gzcat | wc -l
   98924

arademaker commented 2 years ago

The cases of possible invalid MRS I already count, this is my 2297 above.

goodmami commented 2 years ago

Ah, yes, the result file is better because of course some items won't get a parse. Good catch.