propbank / propbank-release

The official released annotations, both in .prop pointer format and as conll files. Does not contain the source texts
Creative Commons Attribution Share Alike 4.0 International
135 stars 12 forks source link

dev/train/test #13

Closed arademaker closed 2 years ago

arademaker commented 3 years ago

Do we have any information about the files ontonotes-{dev,test,train}-list.txt in https://github.com/propbank/propbank-release/tree/master/docs/evaluation ? What criteria were used to make this split? The lines indicate the files, so I am understanding that all sentences in the file should be considered to the corresponding set, right?

timjogorman commented 3 years ago

Those should be the splits used in Conll-2012 shared task, and yes they are split by entire files (as is necessary for coreference shared tasks). I'll yield to @sameer-pradhan regarding how the splits were made: my impression is that the splits were randomly allocated, and then "conll12-test" was created by filtering out files from "test" that didn't have coreference annotations, but it was before my time.

arademaker commented 3 years ago

Considering these ontonotes-{dev,test,train}-list.txt files that contain one file name per line, I expanded them to obtain the sentences for each set. But I have found many missing files, all of them inside the wb/sel collection. Below is the amount of missing files for each directory inside the wb/sel. Does anyone have any explanation for it?

 100 data/ontonotes/wb/sel/87
 100 data/ontonotes/wb/sel/80
  99 data/ontonotes/wb/sel/89
  99 data/ontonotes/wb/sel/81
  99 data/ontonotes/wb/sel/45
  99 data/ontonotes/wb/sel/41
  98 data/ontonotes/wb/sel/66
  98 data/ontonotes/wb/sel/65
  97 data/ontonotes/wb/sel/74
  97 data/ontonotes/wb/sel/44
  97 data/ontonotes/wb/sel/40
  96 data/ontonotes/wb/sel/88
  96 data/ontonotes/wb/sel/79
  96 data/ontonotes/wb/sel/63
  96 data/ontonotes/wb/sel/52
  96 data/ontonotes/wb/sel/51
  95 data/ontonotes/wb/sel/71
  95 data/ontonotes/wb/sel/69
  94 data/ontonotes/wb/sel/73
  94 data/ontonotes/wb/sel/70
  94 data/ontonotes/wb/sel/62
  94 data/ontonotes/wb/sel/39
  93 data/ontonotes/wb/sel/77
  93 data/ontonotes/wb/sel/61
  93 data/ontonotes/wb/sel/58
  93 data/ontonotes/wb/sel/56
  93 data/ontonotes/wb/sel/49
  92 data/ontonotes/wb/sel/78
  92 data/ontonotes/wb/sel/75
  90 data/ontonotes/wb/sel/72
  90 data/ontonotes/wb/sel/57
  89 data/ontonotes/wb/sel/64
  88 data/ontonotes/wb/sel/82
  88 data/ontonotes/wb/sel/37
  85 data/ontonotes/wb/sel/43
  84 data/ontonotes/wb/sel/94
  84 data/ontonotes/wb/sel/83
  83 data/ontonotes/wb/sel/67
  83 data/ontonotes/wb/sel/53
  82 data/ontonotes/wb/sel/76
  82 data/ontonotes/wb/sel/60
  81 data/ontonotes/wb/sel/38
  79 data/ontonotes/wb/sel/97
  79 data/ontonotes/wb/sel/68
  79 data/ontonotes/wb/sel/42
  78 data/ontonotes/wb/sel/86
  78 data/ontonotes/wb/sel/48
  77 data/ontonotes/wb/sel/84
  76 data/ontonotes/wb/sel/46
  71 data/ontonotes/wb/sel/36
  69 data/ontonotes/wb/sel/95
  66 data/ontonotes/wb/sel/50
  62 data/ontonotes/wb/sel/26
  59 data/ontonotes/wb/sel/96
  58 data/ontonotes/wb/sel/59
  58 data/ontonotes/wb/sel/35
  58 data/ontonotes/wb/sel/33
  57 data/ontonotes/wb/sel/92
  56 data/ontonotes/wb/sel/93
  56 data/ontonotes/wb/sel/90
  55 data/ontonotes/wb/sel/23
  54 data/ontonotes/wb/sel/47
  52 data/ontonotes/wb/sel/34
  51 data/ontonotes/wb/sel/31
  51 data/ontonotes/wb/sel/27
  48 data/ontonotes/wb/sel/32
  46 data/ontonotes/wb/sel/91
  44 data/ontonotes/wb/sel/28
  43 data/ontonotes/wb/sel/25
  42 data/ontonotes/wb/sel/54
  40 data/ontonotes/wb/sel/85
  40 data/ontonotes/wb/sel/30
  39 data/ontonotes/wb/sel/22
  37 data/ontonotes/wb/sel/24
  34 data/ontonotes/wb/sel/29
  33 data/ontonotes/wb/sel/98
  31 data/ontonotes/wb/sel/55
  28 data/ontonotes/wb/sel/09
  20 data/ontonotes/wb/sel/11
   2 data/ontonotes/wb/sel/05
   1 data/ontonotes/wb/sel/18
   1 data/ontonotes/wb/sel/10
   1 data/ontonotes/wb/sel/04
   1 data/ontonotes/wb/sel/03
arademaker commented 2 years ago

answered in issue #2