Closed markbrown closed 8 years ago
I don't quite like exposing the concept of dictionary via a dedicated type ThDict. Thai word segmentation can be implemented with several approaches, including rule-based, dictionary-based, trigram-based, or other statistical ones. That's why I used a generic name like ThBrk in the discussion. The dictionary should be hidden, or at least the API should not emphasize it.
With this, the free list in BrkEnv can also be moved into ThBrk instance, to promote more resource reuse.
In src/libthai.map:
+LIBTHAI_0.1.24svn { ... +} LIBTHAI_0.1.19;
Please remove the "svn" suffix from symbol versioning.
Thanks for your work!
On Mon, May 16, 2016 at 5:18 PM, thep notifications@github.com wrote:
I don't quite like exposing the concept of dictionary via a dedicated type ThDict. Thai word segmentation can be implemented with several approaches, including rule-based, dictionary-based, trigram-based, or other statistical ones. That's why I used a generic name like ThBrk in the discussion. The dictionary should be hidden, or at least the API should not emphasize it.
Ah, I see. I have changed the names to what you originally suggested.
With this, the free list in BrkEnv can also be moved into ThBrk instance, to promote more resource reuse.
Isn't this still needed to be separate for thread safety? I expect applications will create a single ThBrk for use by all threads, but each thread will have to create its own BrkEnv including its own free list. Same reason why BrkEnv was created in the first place, if I understand correctly.
In src/libthai.map:
+LIBTHAI_0.1.24svn { ... +} LIBTHAI_0.1.19;
Please remove the "svn" suffix from symbol versioning.
Fixed.
Thanks for your work!
I'm happy to contribute :)
On Wed, May 18, 2016 at 6:18 AM, markbrown notifications@github.com wrote:
On Mon, May 16, 2016 at 5:18 PM, thep notifications@github.com wrote:
With this, the free list in BrkEnv can also be moved into ThBrk instance, to promote more resource reuse.
Isn't this still needed to be separate for thread safety? I expect applications will create a single ThBrk for use by all threads, but each thread will have to create its own BrkEnv including its own free list. Same reason why BrkEnv was created in the first place, if I understand correctly.
Ah, you're right. Let's leave it separated, then.
In src/libthai.map:
+LIBTHAI_0.1.24svn { ... +} LIBTHAI_0.1.19;
Please remove the "svn" suffix from symbol versioning.
Fixed.
Just want to make sure that you have also bumped the version to 0.1.25.
Theppitak Karoonboonyanan http://linux.thai.net/~thep/
On Wed, May 18, 2016 at 3:05 PM, Theppitak Karoonboonyanan < notifications@github.com> wrote:
Just want to make sure that you have also bumped the version to 0.1.25.
Fixed now.
I sent that rather quickly. Did you mean bump the version in configure.ac also?
On Thu, May 19, 2016 at 3:20 AM, markbrown notifications@github.com wrote:
I sent that rather quickly. Did you mean bump the version in configure.ac also?
No. The version in configure.ac is only bumped on new releases.
Theppitak Karoonboonyanan http://linux.thai.net/~thep/
--- a/src/libthai.c +++ b/src/libthai.c @@ -53,8 +53,14 @@
- th_wcisldvowel(), th_wcisflvowel(), th_wcisupvowel(), th_wcisblvowel(),
- th_wcchlevel(), th_wciscombchar() *
- * @subsection ThBrk Functions for word segmentation
- * @subsection ThDictBrk Functions for word segmentation
s/ThDictBrk/ThBrk/
- * @subsection ThBrk Functions for word segmentation with the shared dictionary
- *
- * th_brk_get_shared(), th_brk_free_shared(),
- th_brk(), th_brk_line(), th_wbrk(), th_wbrk_line()
I think the shared dictionary should be hidden from user. Its purpose is just to retain old function behaviors internally.
So, please rename th_brk_get_shared() and th_brk_free_shared() to brk_get_shared_dict() and brk_free_shared_dict() and make them private.
--- a/src/thbrk/thbrk.c +++ b/src/thbrk/thbrk.c @@ -33,12 +33,46 @@
include "thbrk-utils.h"
include "brk-ctype.h"
include "brk-maximal.h"
+#include "brk-common.h"
define MAX_ACRONYM_FRAG_LEN 3
/**
- * @brief Create a word break dictionary instance
- *
- * @param dictpath : the dictionary path, or NULL for default
- *
- * @return the created instance, or NULL on failure
- *
- * Loads the dictionary from the given file and returns the created word
- * break dictionary instance. If @a dictpath is NULL, first searches in
- * the directory given by the LIBTHAI_DICTDIR environment variable,
- * then in the library installation directory. Returns NULL if the
- * dictionary file is not found or cannot be loaded.
- / +ThBrk +th_brk_new (const char *dictpath)
In the documentation, please describe ThBrk as "word breaker" instead of "word break dictionary". This includes the 'brk' argument in thbrk* functions.
As we have discussed, ThBrk represents a generic idea of word breaking engines.
As an exception, the documentation for th_brk_new() may describe it as "dictionary-based word breaker" for now.
Imagine when other approaches are added, we can create other constructors and reimplement the th_brk_brk() et. al. with polymorphism. So, the document should not mention the dictionary.
Thanks!
I've addressed your latest comments.
I've also removed the "const" from ThBrk in the places where I had added it. The reason I put it there in the first place was to help users avoid accidentally deleting the shared dictionary (instead of using the function for that purpose, which also clears the static variable), but we don't need to worry about that if the shared dictionary is not exposed.
Thanks for the feedback!
I feel a bit uneasy to see ThBrk-related stuffs taken apart in two files, thbrk.c and brk-common.c.
Previously, brk-common.c held 2 functionalities regarding the dictionary, loading default and keeping the shared instance. Trying to keep both in the old place makes us put struct _ThBrk details in brk-common, while the more natural place for it should be thbrk.c.
Let's implement ThBrk in thbrk.c, including keeping the shared instance. brk-common.c should only hold the brk_load_default_dict() function and it shouldn't know anything about ThBrk.
Thanks!
That's a good idea. brk-common.c won't need to depend on thbrk at all, but there are three other dependencies to deal with:
The last of these I have circumvented by allowing callers to pass NULL to th_brk_brk to get the shared dictionary, although this is undocumented. To deal with the dependencies properly I'll need to declare non-public functions for thbrk in a header file. Should I use thbrk-utils.h for this purpose, or create a new header file such as thbrk-priv.h?
Thanks for the reviews.
Sorry that I somehow missed the notification, until I log into GitHub to check it.
thbrk-utils.h is meant for thbrk-module-wide utilities, while the ThBrk structure is not. Let's add a new header file thbrk-priv.h to hold the non-public declarations.
Done. How does it look to you?
Looks great!
BTW, I just realize that the names for brk_{get,free}_shared_dict()
should be brk_{get,free}_shared_brk()
instead, to reflect the ThBrk object they manipulate.
Otherwise, I think it's ready for merging.
Fixed.
Merged. Thank you very much for your contribution!
On Wed, Jun 22, 2016 at 6:26 PM, Theppitak Karoonboonyanan < notifications@github.com> wrote:
Merged. Thank you very much for your contribution!
You're welcome!
Looking forward to the next release. Do you have an expected time for that?
Mark
It should be released soon, after I finish updating other parts related to the API addition.
Before they get public, I decide to rename the ThBrk methods to be more meaningful:
The new API is as discussed on the libthai forum.
I've changed the names from what was first suggested, however, because I found the suggested names very confusing (th_brk_brk, th_brk_wbrk, etc).
I've also exported functions to access the shared dictionary used by the backwards compatible part of the implementation. These can be used to force loading as th_ensure_dict_loaded() did in an earlier proposal, and to safely free up resources if the application knows they are not needed. The shared dictionary has a const type to prevent it being explicitly deleted.
With this API you can use the word breaking functions in two ways:
In other words, if you want a non-default dictionary you have to manage the sharing yourself.