iesl / dilated-cnn-ner

Dilated CNNs for NER in TensorFlow
243 stars 60 forks source link

preprocessing before triggering 'preprocess.sh' for ontonotes #27

Open marc88 opened 5 years ago

marc88 commented 5 years ago

Hello, Can anyone suggest on the data processing to be done on conll2012 before calling the following?

./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf Currently, simply calling the preprocess.sh script as above, does not write anything to the file mentioned below and goes into an infinite loop I suppose. data/vocabs/ontonotes_cutoff_4.txt

I've downloaded the train v4, dev v4 and test v9 tarballs from http://conll.cemantix.org/2012/data.html

Edit: I could convert the ontonotes files successfully to conll format but not sure of the directory structure to trigger the preprocessing script. Can you help? The following is my directory structure:

$DILATED_CNN_NER_ROOT/data/conll-formatted-ontonotes-5.0

structure for $DILATED_CNN_NER_ROOT/data/conll-formatted-ontonotes-5.0* ( this directory has all the _gold_conll files. Please take a direcotry below as an example: /home/ss06886910/Strubel_IDCNN/data/conll-formatted-ontonotes-5.0/data/train/data/english/annotations/wb/c2e/00/c2e_0028.v4_gold_conll)**

conll-formatted-ontonotes-5.0
├── data
│   ├── development
│   │   └── data
│   │       ├── arabic
│   │       │   └── annotations
│   │       ├── chinese
│   │       │   └── annotations
│   │       └── english
│   │           └── annotations
│   ├── test
│   │   └── data
│   │       ├── arabic
│   │       │   └── annotations
│   │       ├── chinese
│   │       │   └── annotations
│   │       └── english
│   │           └── annotations
│   └── train
│       └── data
│           ├── arabic
│           │   └── annotations
│           ├── chinese
│           │   └── annotations
│           └── english
│               └── annotations
└── scripts

Tried running with the following parameter in ontonotes.conf ; export raw_data_dir="$DATA_DIR/conll-formatted-ontonotes-5.0/data" ($DATA_DIR = $DILATED_CNN_NER_ROOT/data)

And, I get the following error:

Processing file: data/conll-formatted-ontonotes-5.0/data/development
python /home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py --in_file data/conll-formatted-ontonotes-5.0/data/development --out_dir /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/development --window_size 3 --update_maps False --dataset ontonotes --update_vocab /home/ss06886910/Strubel_IDCNN/data/vocabs/ontonotes_cutoff_4.txt --vocab /home/ss06886910/Strubel_IDCNN/data/embeddings/lample-embeddings-pre.txt --labels /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/label.txt --shapes /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/shape.txt --chars /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/char.txt
Embeddings coverage: 98.67%
Processing file: data/conll-formatted-ontonotes-5.0/data/test
python /home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py --in_file data/conll-formatted-ontonotes-5.0/data/test --out_dir /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/test --window_size 3 --update_maps False --dataset ontonotes --update_vocab /home/ss06886910/Strubel_IDCNN/data/vocabs/ontonotes_cutoff_4.txt --vocab /home/ss06886910/Strubel_IDCNN/data/embeddings/lample-embeddings-pre.txt --labels /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/label.txt --shapes /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/shape.txt --chars /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/char.txt
Traceback (most recent call last):
  File "/home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py", line 498, in <module>
    tf.app.run()
  File "/home/ss06886910/IDCNN/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py", line 494, in main
    tsv_to_examples()
  File "/home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py", line 487, in tsv_to_examples
    print("Embeddings coverage: %2.2f%%" % ((1-(num_oov/num_tokens)) * 100))
ZeroDivisionError: division by zero

Regards

Impavidity commented 5 years ago

@marc88 I have same issue here. Did you fix the successfully ?

Impavidity commented 5 years ago

I figure it out.

In the test set, the conll file name ends with gold_parse_conll instead of gold_conll. So you need to change the line

[y for x in os.walk(FLAGS.in_file) for y in glob(os.path.join(x[0], '*_gold_conll'))\
                         if "/"+data_type+"/" in y and "/english/" in y]

with

[y for x in os.walk(FLAGS.in_file) for y in glob(os.path.join(x[0], '*_gold_parse_conll'))\
                         if "/"+data_type+"/" in y and "/english/" in y]

for test set.