Open Solaner opened 8 years ago
It's hard. A naive implementation would be relatively easy but proper sentence identification is really difficult just in English, without even considering other languages. I did a little googling on this subject when I was writing the TextRange module and it was enough to put me off.
I may come back to this once I've caught up with all the other stuff I've been ignoring for the last year or so.
I’d be happy to see a naive implementation in action, to say if it’s good enough or not.
If such an implementation wouldn’t require modification of the Rangy library, perhaps I could do it with some hints on how to proceed.
I’ve been surfing a while and found some interesting stuff.
For instance Blast.js seems to me, is giving a good result.
It cannot handle format boundaries within a sentence.
But Rangy is perfect in handling these format changes.
So a combination of the two implementations would solve this problem.
Are there other approaches, e.g. with some magic regular expressions?
Something like this one, that works perfect for words in almost any language.
Is it naive to think, that there might be a regular expression which could just be assigned to the wordRegex
property in order to actually turn it to a sentenceRegex
?
I have found material about regular expressions for sentence boundaries.
But with my limited knowledge of the English language and my lack of understanding sophisticated regular expressions, I’m not always sure if they describe available solutions or if they just describe “would be good to have” solutions.
Here are some links about regular expression approaches:
If you can give me some hints, what direction(s) to go, I might be able myself to implement:
range.expand("sentence")
, orrange.expand("word")
calls, combined with some check functions, or
According the documentation, the
unit
must be one of "word" and "character". Is there or was there an implementation of range functions for theunit
"sentence"? In the test files directory there is the file words.html that seems to me, was intended for this. But I could not get it working for theunit
"sentence". Is words.html just a remainder of an unfinished implementation? If no, what's the trick to make it working with theunit
"sentence"? If yes, is the code for this feature still available or is there a workaround to mimic the missing feature?