Open arademaker opened 2 years ago
I ended up with the following list
testsuites=(
"+csli"
"+esd"
"+fracas"
"+mrs"
"+trec"
"?cb"
"+ecoc"
"+ecos"
"=ecpa"
"?ecpr"
"+hike"
"+jh"
"?jhk"
"?jhu"
"+tg"
"?tgk"
"?tgu"
"+ps"
"?psk"
"?psu"
"?rondane"
"+rtc000"
"+rtc001"
"+bcs"
"+ccs"
"+control"
"+scm"
"+peted"
"?petet"
"+pest"
"+omw"
"+ntucle"
"+handp12"
"+sh-spband-r"
"+sh-spec"
"+vm6"
"+vm13"
"+vm31"
"?vm32"
"+wlb03"
"+wnb03"
"+ws201"
"+ws202"
"+ws203"
"+ws204"
"+ws205"
"+ws206"
"+ws207"
"+ws208"
"+ws209"
"+ws210"
"+ws211"
"=ws212"
"?ws213"
"?ws214"
"+wsj00"
"+wsj01"
"+wsj02"
"+wsj03"
"+wsj04"
"+wsj05"
"+wsj06"
"+wsj07"
"+wsj08"
"+wsj09"
"+wsj10"
"+wsj11"
"+wsj12"
"+wsj13"
"+wsj14"
"+wsj15"
"+wsj16"
"+wsj17"
"+wsj18"
"+wsj19"
"=wsj20"
"?wsj21"
"+wsj23"
)
In https://arxiv.org/pdf/1904.11564.pdf, you wrote
About half of the training data comes from the Wall Street Journal (sections 00-21), while the rest spans a range of domains, including Wikipedia, e- commerce dialogues, tourism brochures, and the Brown corpus. The data is split into training, development and test sets with 72,190, 5,288, and 10,201 sentences, respectively.
once the script executed, I counted the graphs with:
(venv) ar@tenis mrs-to-penman % rg "^\(\)" | wc -l
2297
(venv) ar@tenis mrs-to-penman % rg "^\([0-9]" | wc -l
69319
So I am missing (72190+5288+10201)-69319 = 18,360
sentences...
The profiles sum up ..
% find profiles -name 'item.*' | xargs gzcat | wc -l
131401
How were the dev, test, and train sets defined for https://github.com/goodmami/mrs-to-penman/blob/master/convert-redwoods.sh#L8-L187?
These were taken from the redwoods.xls
file linked in the comment above the code you linked to here.
Regarding the new distribution of the Redwoods 2020 data, I don't really know what changed or why, so I cannot comment on your proposed list.
Regarding the counts, a few things:
item
files are not always a good indicator of the number of items. Some of those may be specified as to be ignored (e.g., when they contain non-linguistic data scraped from a web page, like a table of numbers). You should filter on those where i-wf
is 1
.These may account for the discrepancies you saw.
Sorry, I was reading the profile inputs but I should read the results:
% find profiles -name 'item.*' | xargs gzcat | wc -l
131401
% find profiles -name 'result.*' | xargs gzcat | wc -l
98924
The cases of possible invalid MRS I already count, this is my 2297 above.
Ah, yes, the result file is better because of course some items won't get a parse. Good catch.
This is related to https://github.com/delph-in/docs/issues/40, and maybe @olzama and @danflick can add something.
The names of the ERG gold profiles in the
tsdb/gold
changed. The http://svn.delph-in.net/erg/tags/2020/etc/redwoods.xls didn't preserve the old names, which is pretty confusing. So, for example,wsj06c
now is onlywsj06
, right?How were the dev, test, and train sets defined for https://github.com/goodmami/mrs-to-penman/blob/master/convert-redwoods.sh#L8-L187? The new names can impact the dev/test/train sets?