stucco / docs

Documentation and Issue Tracking for Stucco
https://stucco.github.io/
Other
20 stars 7 forks source link

entity labeling #7

Open jtyoui opened 6 years ago

jtyoui commented 6 years ago

Excuse me, how does the entity labeling be?

mikeiannacone commented 6 years ago

I'm not sure what you mean. Can you be more specific?

jtyoui commented 6 years ago

Could you tell me How can you name the NER?

jtyoui commented 6 years ago

I wanna know How can I withdraw the entity relationship in chinese ?if I annotate the entity first. what shared I do them.

mikeiannacone commented 6 years ago

Ok, I'm still not quite sure what you need.

Do you want to present the data to the user in Chinese? This wouldn't really need any changes to the NLP, database, or anything else on the back end. I could show you what changes would be needed in the UI (and possibly in the REST API, if needed.)

Do you have documents in Chinese that you would like to label and store? This would definitely require new sets of training data, and may also need some other changes to the NLP. That would probably be more difficult, but I would have to ask the other team members for more input on specifics.

Anyway, let me know what you need and then hopefully I'll have more specific information.

wawang250 commented 6 years ago

hi, actually we are trying to do a Named Entity Recognition on a set of Chinese documents. But these documents are not labeled. We have tried your project in English files and it worked very well.

We world like to know that how should we label our documents or the ENTITIES in these documents so that we can make a proper TRAINING SET. Or can you give us a little demo of your labeled data set and we might have a clue of where to start.

Thanks again for your reply.

mikeiannacone commented 6 years ago

Ok, after thinking through this, and getting some input from the rest of the team, I think I can point you guys in the right direction on this.

To support any non-english documents, you will need to make some changes to the entity extractor and the relation extractor. Both repos contain updated and reasonably detailed README files that describe them, but to summarize: the entity extractor labels the "entities" it finds in the text, and the relation extractor decides how those entities are related to each other. For example, if a sentence contains two version numbers and two software products, the entity extractor would find and label them, and the relation extractor would match each product with it's version.

The entity extractor uses Stanford’s CoreNLP for a lot of non-domain-specific tasks, including sentence splitting, tokenizing, part of speech (POS) tagging, and generating the parse tree. This library apparently has Chinese models that can be loaded, but you'll need to look through their documentation for the specifics.

After all of that pre-processing has been done, the entity extractor then uses gazetteer(s) (basically a dictionary) to label known entities (eg. "Microsoft"). After that it uses a trained Apache OpenNLP averaged perceptron model to find entities not contained in the gazetteer (eg. "Obscure Developer LLC".) You would need to replace or expand these gazetteers - ours was generated from sources like Freebase and Wikipedia, which should include many languages. To generate a new Apache OpenNLP model, you'll need your own training corpus. Information about how we generated those models is in our recent publication here: https://ieeexplore.ieee.org/document/8260670/. These models and dictionaries you're replacing are contained in the resources directory of that project.

The relation extraction can be done with either pattern matching or SVM models, depending on which branch is checked out. (Master branch uses pattern matching, "svm" branch uses SVM.) Either one would need to be updated to use Chinese sentence patterns, or new SVM models. In either branch, those will be in that project's resources directory.

If you can change the entity extractor to load the appropriate CoreNLP models, and then replace all of the gazetteers and models in the resources directories in both of those projects, you guys should be able to get that working in any language you like. Generating those models and gazetteers was somewhat difficult, but that publication I linked above should help get you started with generating and evaluating them.