greyblake / whatlang-rs

Natural language detection library for Rust. Try demo online: https://whatlang.org/
https://whatlang.org/
MIT License
969 stars 109 forks source link

Adding Punctuation in Devnagari Script (Hindi) reduces 'Confidence'. #68

Open abhishekkr opened 3 years ago

abhishekkr commented 3 years ago

I tried giving it a test for Hindi (in Devnagari Script).

With a random sentence the confidence was 39.x%. So I tested it with a sentence "It is What Language.", I guessed might be close to home for this.

In Devnagari Script, FullStop is written as '|'. When I tested the sentence without it, confidence was 100%. But on including it, the confidence dropped few points.

I'm a n00b when it comes to Rust, but can try fixing it if you don't have time and can point me in right direction.

Also, if you need any help training models and have a guide I can follow.

PS: Thanks for this. I was looking for an interesting project to try restart Rust journey.

Screenshot_20210211-082028_Chrome.jpgScreenshot_20210211-082019_Chrome.jpg

greyblake commented 3 years ago

Hi @abhishekkr , thank you for the report! Right now whatlang is under heavy refactoring and improvements. The old version is based on trigrams only. The new one will also take alphabetics into considerations (see: https://github.com/greyblake/whatlang-rs/tree/alphabet/src/alphabets)

The reason for your results can be the following: | is not recognized as punctuation and it is used to build trigrams, what delutes a confidence in the final result.

Unfortunately I have zero knowledge about Hindi, so if you're available for assisting with Devnagari languages, that would be very helpful!

abhishekkr commented 3 years ago

Sure, I'd be glad to help with that (Hindi being my first language) and any other feature you might not be able to pick due to schedule.

Just 2 things; I've only done minimal introductory Rust in it's pre-stable releases and might not be always on-time due to pre-commitments. I wouldn't need hand holding with Rust & project in general, that I'll manage (with feedback)... but I can't do away with commitments.