Ontonotes data and experiments

anoopsarkar commented 8 years ago

Task: Convert the Ontonotes data into the CoNLL format.

The instructions for conversion are given here: http://cemantix.org/data/ontonotes.html

It also contains the script to convert to CoNLL format for all three languages: English, Chinese and Arabic.

jerryljq commented 8 years ago

Hi Anoop, I think this issue should also be assigned to me, right? Jiaqi.

anoopsarkar commented 8 years ago

@vista521 added you to the development team on github. once you accept I can add you as assignee.

jerryljq commented 8 years ago

@anoopsarkar Oops seems I didn't see any notifications to let me accept it. Probably there's something wrong...

anoopsarkar commented 8 years ago

@vista521 you should have access now.

for all the assignees, please only push to a branch for the ontonotes experiments along with your collaborators. when the convertors are written and experiments are done please send a pull request with any ontonotes conversion code and the log files for the experiments.

liushiqi9 commented 8 years ago

@anoopsarkar Now we have converted only English data because we don't have conll-formatted for Chinese and Arabic.

This is how it looks like for English.conll:

bn/abc/00/abc_0009   0    4    optimistic    JJ       (S(ADJP*        -    -   -   -       *   (C-ARG1*     -
bn/abc/00/abc_0009   0    5         about    IN           (PP*        -    -   -   -       *          *     -
bn/abc/00/abc_0009   0    6           the    DT        (NP(NP*        -    -   -   -       *          *     -
bn/abc/00/abc_0009   0    7        future    NN              *)   future   -   1   -       *          *     -
bn/abc/00/abc_0009   0    8            of    IN           (PP*        -    -   -   -       *          *     -
bn/abc/00/abc_0009   0    9           the    DT           (NP*        -    -   -   -       *          *   (12
bn/abc/00/abc_0009   0   10       Mideast   NNP        *)))))))       -    -   -   -    (LOC)         *)   12)
bn/abc/00/abc_0009   0   11             .     .             *))       -    -   -   -       *          *     -

bn/abc/00/abc_0009   0    0          That    DT   (TOP(SINV(SBAR(S(NP*)       -    -   -   -       *   (ARG1*     -
bn/abc/00/abc_0009   0    1            's   VBZ                   (VP*        be   -   1   -       *        *     -
bn/abc/00/abc_0009   0    2           the    DT                (NP(NP*        -    -   -   -       *        *     -
bn/abc/00/abc_0009   0    3    heartbreak    NN                      *)       -    -   -   -       *        *     -
bn/abc/00/abc_0009   0    4            of    IN                   (PP*        -    -   -   -       *        *     -
bn/abc/00/abc_0009   0    5          this    DT                   (NP*        -    -   -   -       *        *   (12
bn/abc/00/abc_0009   0    6        region    NN                 *))))))   region   -   3   -       *        *)   12)
bn/abc/00/abc_0009   0    7          says   VBZ                   (VP*)      say  01   1   -       *      (V*)    -
bn/abc/00/abc_0009   0    8           one    CD                   (NP*        -    -   -   -       *   (ARG0*     -
bn/abc/00/abc_0009   0    9         State   NNP                  (NML*        -    -   -   -   (ORG*        *     -
bn/abc/00/abc_0009   0   10    Department   NNP                      *)       -    -   -   -       *)       *     -
bn/abc/00/abc_0009   0   11      official    NN                      *)       -    -   -   -       *        *)    -
bn/abc/00/abc_0009   0   12             .     .                     *))       -    -   -   -       *        *     -

bn/abc/00/abc_0009   0    0    Whenever   WRB   (TOP(S(SBAR(WHADVP*)     -    -   -   -   *    (ARGM-TMP*)     *   (ARGM-TMP*      *             *   -
bn/abc/00/abc_0009   0    1         you   PRP                (S(NP*)     -    -   -   -   *        (ARG0*)     *            *      *             *   -
bn/abc/00/abc_0009   0    2        take   VBP                  (VP*    take  01   1   -   *           (V*)     *            *      *             *   -
bn/abc/00/abc_0009   0    3           a    DT                  (NP*      -    -   -   -   *        (ARG1*      *            *      *             *   -
bn/abc/00/abc_0009   0    4        step    NN                     *)   step   -   1   -   *             *)     *            *      *             *   -
bn/abc/00/abc_0009   0    5     forward    RB             (ADVP*))))     -    -   -   -   *    (ARGM-DIR*)     *            *)     *             *   -
bn/abc/00/abc_0009   0    6           ,     ,                     *      -    -   -   -   *             *      *            *      *             *   -
bn/abc/00/abc_0009   0    7         you   PRP                  (NP*)     -    -   -   -   *             *      *            *      *        (ARG1*)  -
bn/abc/00/abc_0009   0    8         are   VBP                  (VP*      be  03   -   -   *             *    (V*)           *      *             *   -
bn/abc/00/abc_0009   0    9       bound   VBN                  (VP*    bind  02   -   -   *             *      *          (V*)     *             *   -
bn/abc/00/abc_0009   0   10          to    TO                (S(VP*      -    -   -   -   *             *      *       (ARG1*      *             *   -
bn/abc/00/abc_0009   0   11          be    VB                  (VP*      be  03   -   -   *             *      *            *    (V*)            *   -
bn/abc/00/abc_0009   0   12      pushed   VBN                  (VP*    push  01   1   -   *             *      *            *      *           (V*)  -
bn/abc/00/abc_0009   0   13         way    RB                (ADVP*)     -    -   -   -   *             *      *            *      *    (ARGM-EXT*)  -
bn/abc/00/abc_0009   0   14        back    RB          (ADVP*)))))))     -    -   -   -   *             *      *            *)     *        (ARG2*)  -
bn/abc/00/abc_0009   0   15           .     .                    *))     -    -   -   -   *             *      *            *      *             *   -

bn/abc/00/abc_0009   0   0        Martha   NNP  (TOP(FRAG(NP*   -   -   -   -   (PERSON*   -
bn/abc/00/abc_0009   0   1       Raddatz   NNP              *)  -   -   -   -          *)  -
bn/abc/00/abc_0009   0   2             ,     ,              *   -   -   -   -          *   -
bn/abc/00/abc_0009   0   3           ABC   NNP           (NP*   -   -   -   -      (ORG*   -
bn/abc/00/abc_0009   0   4          News   NNP              *)  -   -   -   -          *)  -
bn/abc/00/abc_0009   0   5             ,     ,              *   -   -   -   -          *   -
bn/abc/00/abc_0009   0   6           the    DT           (NP*   -   -   -   -      (FAC*   -
bn/abc/00/abc_0009   0   7         State   NNP              *   -   -   -   -          *   -
bn/abc/00/abc_0009   0   8    Department   NNP              *)  -   -   -   -          *)  -
bn/abc/00/abc_0009   0   9             .     .             *))  -   -   -   -          *   -

anoopsarkar commented 8 years ago

According to the website: http://cemantix.org/data/ontonotes.html the script skeleton2conll.sh should work on all three Ontonotes languages. Does it do strange things when run on Chinese and Arabic?

anoopsarkar commented 8 years ago

Try the script on this page: http://conll.cemantix.org/2012/data.html

jerryljq commented 8 years ago

I have downloaded the new script and format files and run the script based on the new files. Seems the new files worked on Chinese and Arabic.

anoopsarkar commented 8 years ago

Did you also repeat the conversion for English with the new script?

jerryljq commented 8 years ago

@anoopsarkar Yes, I think so. The new script comes with a new data set, including all three languages. I just saved all those converted files in a new folder.

anoopsarkar commented 8 years ago

ok. next step will be to create a new format file and config files for the new data. then the experiments can be run to train on ontonotes for each language and measure UAS on dev data.

anoopsarkar commented 8 years ago

English Ontonotes skel files are available at this location:

https://github.com/ontonotes/conll-formatted-ontonotes-5.0

(just for future reference)

jerryljq commented 8 years ago

@anoopsarkar Hi Anoop, I read the meeting notes last week. We should start doing training and testing on the dev sets. Since the data does not include the dependency tree, should we just run pos_tagger.py to train and test the POSTAG only?

anoopsarkar commented 8 years ago

@vista521 the plan was to use penn2malt for English and Chinese (we have the head rules for these two languages) to convert into dependency format.

jerryljq commented 8 years ago

@anoopsarkar I have extracted data as the input to Penn2Malt, but I have a problem here. When I try to run the tool using the headrules provided on its website, it told me that "could not find category" when it tried to match some keywords like TOP, NML and so on. These keywords are not in the headrule file. I doubt if there is a new version of the headrule file, since the one provided on the website could date back to 2003 while our data is in 2012. I also did not find any related files locally. Could you help on this problem? The Penn2Malt website is: http://stp.lingfil.uu.se/~nivre/research/Penn2Malt.html

anoopsarkar commented 8 years ago

@vista521 was this for English or Chinese?

jerryljq commented 8 years ago

@anoopsarkar It's for both English and Chinese.

jerryljq commented 8 years ago

@anoopsarkar Hi Anoop, is there any updates or ideas for the issue above?

anoopsarkar commented 8 years ago

have a look at the English headrules given in this presentation: http://nlp.mathcs.emory.edu/doc/tlt-2010-choi-slides.pdf

anoopsarkar commented 8 years ago

Implementation of the above presentation seems to be here: https://github.com/clir/clearnlp/blob/master/src/main/java/edu/emory/clir/clearnlp/conversion/EnglishC2DConverter.java

You may have to install the entire clearNLP toolkit:

https://github.com/clir/clearnlp

jerryljq commented 8 years ago

@anoopsarkar I have searched the whole project you provided above and finally found the a txt file which contains organized head rules. Now the Penn2Malt tool could work. However, the head rule only works for English, so it seems Chinese and Arabic cannot be handled.

kalryoma commented 8 years ago

I've tested the converted data for English. Result shows as following: (5 iterations, 75000+sentences per iteration)

Total Training Time:  29926.6495328
Interface object FirstOrderFeatureGenerator detected
     with interface get_local_vector
Evaluating...
Unlabeled accuracy: 0.874648393964
Unlabeled attachment accuracy: 0.881284276603

sfu-natlang / glm-parser

Ontonotes data and experiments #32