Open GoogleCodeExporter opened 9 years ago
So, is all you need to know is whether the text is in "Latin" or not? (eg
Spanish and French are written using the same character type as English). If
so, I can paste here a quick code (in java) to check the character set.
could you clarify your needs?
Original comment by mawa...@live.com
on 23 Mar 2011 at 2:58
To be precise, I want to know whether text is English or not (not latin or
not). My understanding is that langdetect module detects languages not just
based on character sets, but also using ngram distribution. So, I was wondering
how can we customize langdetect module to identify whether given text is
english or not, so that French or Spanish text will be identified as
non-english text.
Original comment by zeeva...@gmail.com
on 23 Mar 2011 at 8:45
[deleted comment]
Understood. Two immediate options I can think of:
1- remove all other profiles other than English. This might have an affect when
normalising the prob (unsure, so I would need to test it)
2- Just control it within your code (if "en" then ...... else....). Reducing
the number of profiles may well increase accuracy. But by how much? 5%? is it
really worth spending much more time testing and exploring?
You sound pretty technical. Do you have programming background or do you need
help with implementing the proposed options?
Original comment by mawa...@live.com
on 23 Mar 2011 at 9:56
Thanks for your response.
1) I am afraid to say that there is problem with the first proposed option, if
I understood it correctly. Don't we need to have one more profile other than
English, so that system could predict either English or not. If we just keep
English profile, then language detected will be always English. Isn't it?
Correct me, if my understanding is wrong. Actually I am thinking this option
would work if one more profile is created from all other non-English texts. Let
me know if it is possible.
2) As of now, I am using the second option. Surely would love to have a
solution works better than this.
Yes, I do have programming background. Appreciate your response.
Original comment by zeeva...@gmail.com
on 24 Mar 2011 at 1:13
1) If you have a single profile, the language id will compare the n-gram
generated from your input text with the single profile (for example "en").
Depending on your max threshold, the language id will return either "en" or
nothing (or unknown).
Right, I will try this over the weekend and will come back to you early next
week after conducting a few tests to measure the accuarcy of this option.
Original comment by mawa...@live.com
on 24 Mar 2011 at 8:51
The langdetect-bundled profiles are learned not by English or not but by each
language.
What you want to do needs re-learning language profiles with English or not.
As Comment 4-1 says, If you remain only English profile, langdetect will always
output "en" 100%...
So within the current langdetect package, the best way is the proposal in the
comment 4-2, I think too.
Original comment by nakatani.shuyo
on 27 Mar 2011 at 2:19
I have tried a few options. The fastest best way is to handle it as Nakatani
suggested (comment 4-2).
Original comment by mawa...@live.com
on 31 Mar 2011 at 9:45
Original issue reported on code.google.com by
zeeva...@gmail.com
on 22 Mar 2011 at 2:57