Language recognition fails for programming language code #195

Closed Savan2708 closed 7 months ago

Savan2708 commented 7 months ago

When I provide any code(c#, c++, c) as input then i am getting mostly Language.YORUBA, Language.ESPERANTO or any rendom language, it should be detected as Language.ENGLISH

pemistahl commented 7 months ago

Lingua's purpose is to recognize natural languages only but not programming languages.

Savan2708 commented 7 months ago

Lingua's purpose is to recognize natural languages only but not programming languages.

I know Lingua's purpose is to recognize natural languages only but i have cleaned all punctuation, all extra spaces and did other text pre processing as well then Lingua's should detect it as English because all variable names, class names, function names are in English.

pemistahl commented 7 months ago

Please give me an example for a text after doing your pre-processing. Otherwise, I'm not able to help you.

Savan2708 commented 7 months ago

Please give me an example for a text after doing your pre-processing. Otherwise, I'm not able to help you.


pemistahl commented 7 months ago

Well, it's obvious, isn't it? Your text contains lots of weird abbreviations such as hz or bd and compounds such as parsegeojson or registercoordinatesystem which have nothing to do with grammatically correct English. The statistical model is trained on grammatical English, that's why Lingua is not able to correctly identify the language of code.

Savan2708 commented 7 months ago

Well, it's obvious, isn't it? Your text contains lots of weird abbreviations such as hz or bd and compounds such as parsegeojson or registercoordinatesystem which have nothing to do with grammatically correct English. The statistical model is trained on grammatical English, that's why Lingua is not able to correctly identify the language of code.

ok understood, thank for the help

Savan2708 commented 7 months ago

Well, it's obvious, isn't it? Your text contains lots of weird abbreviations such as hz or bd and compounds such as parsegeojson or registercoordinatesystem which have nothing to do with grammatically correct English. The statistical model is trained on grammatical English, that's why Lingua is not able to correctly identify the language of code.

so, is there any way to handle this type of conditions if i get this type of files while processing large data ?

pemistahl commented 7 months ago

If there exists a library for recognizing programming languages, then maybe yes. With Lingua alone, this is not possible.