kamalkraj / Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs

Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs
GNU General Public License v3.0
358 stars 142 forks source link

Data preparation for NER in CONLL 2003 BIO format #12

Closed vigneshprajapati closed 5 years ago

vigneshprajapati commented 5 years ago

To train my own NER over custom entities, I need my dataset preapared with CONLL-2003 format.

How would I convert my text documents (.txt) files to specified CONLL-U format - like [Word POS CHUNK NER]. Any tool available for this operation?

Note: For the given text documents, I am already having custom NER tags.

Sample data (training_data.txt):

(Sample 1) This Agreement of Work is made pursuant to the Global Developer Master Services Agreement effective as of May 24, 2018, as amended on March 28, 2016, between MA[CUSTOM_ENTITY], lnc.[CUSTOM_ENTITY] whose registered office or principal place of business is at 520 Madison Avenue, Ahmedabad, India, whose registered office or principal place of business is at Building A, Atlantis de la, Switzerland, collectively and ABC[CUSTOM_ENTITY] LLC[CUSTOM_ENTITY] a wholly owned subsidiary of Amazon Services Ltd and having its registered office at 113 Red Avenue, 10th Floor, New York, NY 13027.

(Sample 2) This Agreement of Work is subject to the terms and conditions of the Master Agreement for Technology Consulting Services between Vignesh[CUSTOM_ENTITY] Services[CUSTOM_ENTITY] Limited[CUSTOM_ENTITY] and ABD[CUSTOM_ENTITY] LLC[CUSTOM_ENTITY], an entity wholly owned by ABC[CUSTOM_ENTITY] Holdings[CUSTOM_ENTITY] LLC[CUSTOM_ENTITY].

(Sample 3) This Agreement of Work dated October 22, 2013 between Google[CUSTOM_ENTITY] Services[CUSTOM_ENTITY] Limited[CUSTOM_ENTITY] and Avaya[CUSTOM_ENTITY] Communications[CUSTOM_ENTITY] Management[CUSTOM_ENTITY], LLC[CUSTOM_ENTITY] and any of its operating subsidiaries and affiliates which receive Services from Vendor incorporates and is governed by the terms and conditions contained in the Master Services Agreement Services, by and between Avaya and Vendor. Where [CUSTOM_ENTITY] is the tag for new entity to be trained with NER.

kamalkraj commented 5 years ago

@vigneshprajapati Try https://www.lighttag.io/

vigneshprajapati commented 5 years ago

Thanks @kamalkraj. Will this tool generate tagged data in Conll 2003 format?

On Tue 21 May, 2019, 11:45 AM Kamal Raj <notifications@github.com wrote:

@vigneshprajapati https://github.com/vigneshprajapati Try https://www.lighttag.io/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kamalkraj/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs/issues/12?email_source=notifications&email_token=AAOAKFXL25SE6COIW3RN2V3PWOHRJA5CNFSM4HOE5QMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV23TRQ#issuecomment-494254534, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOAKFVRE3X75X7CTFJ3RF3PWOHRJANCNFSM4HOE5QMA .

kamalkraj commented 5 years ago

@vigneshprajapati
I don't think it will be able to generate data in conll 2003 format . But you can use that website tag your data and rewrite readfile function for training the model on your data . All you need sentence and ner tags . You can also use spacy gold Screenshot 2019-05-21 at 3 40 09 PM