stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.55k stars 2.7k forks source link

Arabic Processing data #1441

Open A7med7x7 opened 2 months ago

A7med7x7 commented 2 months ago
Screenshot 2024-04-16 at 10 59 36 PM

why the arabic processing data seems to be violence and records of criminal activity, like it's derived from a news API?

AngledLuffa commented 2 months ago

They usually are derived from news API. I don't have any influence over the data itself for Arabic

It'd be easier and faster for us to use more varied sources of data for Stanza, but we can always redo either the Stanza or the CoreNLP models if you have more data available

A7med7x7 commented 2 months ago

@AngledLuffa Yes, you definitely don't have control but at least providing general data that is not biased towards one behaviour will help a lot, as all the samples I have seem to be this type. data is everywhere what's more important is to check it's validity and why it intuitively means, and I can help wtih that.