I have documents with URLs and added the UrlTextFilter to remove them so I get a good language detection. But on some test data the language detection was wrong or at least with a very low accuracy.
The test document (german text) with the UrlTextFilter shows a propability of 0.15 for german and 0.7 for nl.
The URLs are rather complex with some special chars (brackets and so on) in it. After removing the URLs with a more complex regexp before sending the text to the language detector, the probability for the same text is 0.99 for german.
So I suggest you improve the regular expressions.
I'll try to provide a PR, but have to check this first...
I have documents with URLs and added the UrlTextFilter to remove them so I get a good language detection. But on some test data the language detection was wrong or at least with a very low accuracy.
The test document (german text) with the UrlTextFilter shows a propability of 0.15 for german and 0.7 for nl.
The URLs are rather complex with some special chars (brackets and so on) in it. After removing the URLs with a more complex regexp before sending the text to the language detector, the probability for the same text is 0.99 for german.
So I suggest you improve the regular expressions.
I'll try to provide a PR, but have to check this first...