Ported Text Segmentation and Unicode Regex Parser

twitter / twitter-cldr-js

JavaScript implementation of the ICU (International Components for Unicode) that uses the Common Locale Data Repository to format dates, plurals, and more. Based on twitter-cldr-rb.

Apache License 2.0

346 stars 55 forks source link

Ported Text Segmentation and Unicode Regex Parser #47

Closed arnavk closed 10 years ago

arnavk commented 10 years ago

Note:

Separating CodePoint data is still WIP. I'm sending in a PR before completing it so that the review process can start while I figure it out.
All the tests are based on their counterparts in twitter-cldr-rb.
Since Range is not a data type in JavaScript (yet), I made a light implementation of Range in TwitterCldr.Range. Is this the best way?

Special Note:

Before this gets merged into the master, the data separation for CodePoint needs to be completed as injecting the CodePoint resources here makes the total bundle size ~3.4MB, which is too much. @KL-7 and I discussed this over email and agreed that Text Segmentation should be an opt-in feature that requires a spearate file (with all the CodePoint resources ~2.5MB).

arnavk commented 10 years ago

I haven't updated all the resources yet either. I'll do it once I've efficiently separated the data.

camertron commented 10 years ago

Hey @radzinzki thanks so much for working so hard and submitting this pull request. I'll take a deep look at it later this week. Let me address your questions as well:

Defining your own implementation of Range sounds great. It's probably part of some JavaScript library, but who knows :)
I'm surprised the codepoint data is so large... can we compress it in some way? I just stumbled across the way ICU does this - they use a data structure called a hash array mapped trie (HAMT). I think this is their implementation. Until we can compress the data, I think it's perfectly fine to make segmentation an opt-in feature.

arnavk commented 10 years ago

Hi @cameron, thanks for the link. I'll take a look and see if this saves us any space. I'm exploring other options as well. Let's see how this goes!

Also, the implementation of Range is fairly minimal with only whatever is required for the feature!

camertron commented 10 years ago

This looks quite good, @radzinzki! Here are a couple of general things I'd like you to fix across all files:

Please wrap all function calls in parens.
Please use 2-space indentation everywhere (there are a number of spec files that contain large indents).
Remove trailing whitespace.

Any luck compressing the codepoint data?

camertron commented 10 years ago

I'm also having trouble running an example. Here's my code:

var fs = require('fs');
eval(fs.readFileSync('./segmentation_data.js'));
var tc = require('./en.js')
var brk = new tc.BreakIterator('en');
brk.each_sentence("Mrs. foo went to. The store.");

I get a message that says "Couldn't find property sentence_break containing property value CR", which indicates to me that the codepoint data hasn't been loaded. Any ideas?

arnavk commented 10 years ago

Let me try this and find out. It works in browsers. I'll try it via Node as well. (I suppose that's how you were trying it, yes?)

arnavk commented 10 years ago

regarding the tab size and trailing whitespace: Sorry about that. I'll fix it all.

camertron commented 10 years ago

Yes, I was trying to run my example in node. What's the best way to load the codepoint data?

arnavk commented 10 years ago

@camertron can you try this in Node?

var fs = require('fs');
var TwitterCldr = require('./en.js')
eval(fs.readFileSync('./segmentation_data.js', utf-8));
var brk = new TwitterCldr.BreakIterator('en');
brk.each_sentence("Mrs. foo went to. The store.");

It should give the expected result:

[ 'Mrs. foo went to.', ' The store.' ]

(I'd been testing it directly in a browser so far.)

camertron commented 10 years ago

Sweet, it works! Thanks @radzinzki.

arnavk commented 10 years ago

@camertron No issues. With the separation, the main implementation bundle needs to be loaded first. segmentation_data.js overwrites some variables in TwitterCldr

devongovett commented 10 years ago

Hey guys. Just stumbled on this discussion. I wrote an implementation of ICU's trie data structure that they use for looking up codepoint data a few months ago here. It includes both the builder code and runtime code. I used it in my grapheme-breaker module if you want to see an example of it in use. Just thought I'd mention this since it might save you some effort. Cheers!

arnavk commented 10 years ago

@devongovett Thanks, buddy! I'll take a look at it and see if and how much of it we can use. :)

arnavk commented 10 years ago

Segmentation is not opt-in anymore. Its implementation and data is included in the main bundle.
- Cleaned the build process to not separate the segmentation data.
Removed trailing spaces from some files.
Updated the internal build

KL-7 commented 10 years ago

I see you've made some great progress here. Well done guys! I'm back from vacation and will try to review this PR in the next few days. If it already looks good to you, @camertron, feel free to merge it earlier.

arnavk commented 10 years ago

@KL-7 @camertron I am still working on the docs. It should update them in a commit probably tomorrow or the day after. Can we hold off the merge till then?

camertron commented 10 years ago

Yes, let's wait until the docs are ready. I'd also like @KL-7 to have a chance to look at the PR before it gets merged.

KL-7 commented 10 years ago

After looking through most of the changes all I can say is that you've done enormous amount of work on this one, Arnav. Thank you for that. I'm not very familiar with the Ruby implementation of this feature, so it's a bit hard to grasp all of it at once. For now I left only a few minor style comments. For deeper feedback I completely trust Cameron since he wrote the original, so it should be easier for him to tell if the CoffeeScrip implementation looks good in general.

arnavk commented 10 years ago

Forgot to post an update earlier, but I've added the docs.

arnavk commented 10 years ago

@camertron, removed the code that isn't being used. So, I think it's good to go.