Open FrankYFTang opened 3 years ago
In this document under "input_name" I explain the relationship between the name of the models and hyperparameters. For pick_lstm_model
it's actually much simpler: embedding should be the embedding that appears in name of the model, e.g. if you have codepoints
in name of the model we need embedding="codepoints"
and if we have graphclust
in name of the model we need embedding="grapheme_clusters_tf"
. The choice of train_data
and eval_data
shouldn't be important if you are segmenting arbitrary lines (by calling segment_arbitrary_line
function) which is what I see in your code. However, if you want to train and evaluate using BEST data or my.txt file, you need to set train_data
and eval_data
to appropriate values that I explained in the link above.
In fact, it is possible to get rid of the variable embedding
for pick_lstm_model
if it is guaranteed that any trained model in future follows the naming convection I explained in this link, but I just left it there because I wasn't sure if that's the case.
So when I use Thai_graphclust_model5_heavy the embedding should be "grapheme_clusters_tf" right?
I think there is some bug there for grapheme_clusters_tf.
It does not make sense to me in some cluster: For example, for the input
พิธีส
we should get 121, 234, 22 as the cluster id but right now in python we got 121, 235, 22 as the cluster id
~/lstm_word_segmentation$ jq . Models/Thai_graphclust_model5_heavy/weights.json |egrep " (22|121|234|235)," "ส": 22, "พิ": 121, "ธี": 234, "ป่": 235,
My C++ code will get me 121,234,22 but that does not match the python one, this is before feeding into LSTM.
On Thu, 21 Jan 2021 at 17:09, Sahand Farhoodi notifications@github.com wrote:
In fact, it is possible to get rid of the variable embedding for pick_lstm_model if it is guaranteed that any trained model in future follows the naming convection I explained in this link https://github.com/unicode-org/lstm_word_segmentation/blob/master/Models%20Specifications.md, but I just left it there because I wasn't sure if that's the case.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/lstm_word_segmentation/issues/10#issuecomment-765044914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2N2KMBT2DSSSYINNE5OJDS3DF5ZANCNFSM4WNWS76Q .
-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer
On Tue, 26 Jan 2021 at 23:51, Frank Tang (譚永鋒) ftang@google.com wrote:
So when I use Thai_graphclust_model5_heavy the embedding should be "grapheme_clusters_tf" right?
I think there is some bug there for grapheme_clusters_tf.
It does not make sense to me in some cluster: For example, for the input
พิธีส
we should get 121, 234, 22 as the cluster id but right now in python we got 121, 235, 22 as the cluster id
~/lstm_word_segmentation$ jq . Models/Thai_graphclust_model5_heavy/weights.json |egrep " (22|121|234|235)," "ส": 22, "พิ": 121, "ธี": 234, "ป่": 235,
My C++ code will get me 121,234,22 but that does not match the python one, this is before feeding into LSTM.
ok, I think I know what is going on. I am using the data from the json one, and the python code is using the npy inside the Data directory. Somehow they do not match.
Neither Thai_graph_clust_ratio.npy nor Thai_exclusive_graph_clust_ratio.npy will have "ธี" as 234 but somehow Models/Thai_graphclust_model5_heavy/weights.json has "ธี" as 234
On Thu, 21 Jan 2021 at 17:09, Sahand Farhoodi notifications@github.com wrote:
In fact, it is possible to get rid of the variable embedding for pick_lstm_model if it is guaranteed that any trained model in future follows the naming convection I explained in this link https://github.com/unicode-org/lstm_word_segmentation/blob/master/Models%20Specifications.md, but I just left it there because I wasn't sure if that's the case.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/lstm_word_segmentation/issues/10#issuecomment-765044914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2N2KMBT2DSSSYINNE5OJDS3DF5ZANCNFSM4WNWS76Q .
-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer
-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer
I am pretty sure Models/Thai_graphclust_model*/weights.json were not generated by neither the current version of Thai_exclusive_graph_clust_ratio.npy nor Thai_graph_clust_ratio.npy in the Data directory and I am not sure what will be the output quality from the python now with the current version of these two files in the Data directory.
Somehow most of the order are the same but about 5-10% are different
check ~/lstm_word_segmentation$ ls Models/Thai_graphclust_model*/weights.json|xargs jq .dic|egrep ": 234," "ธี": 234, "ธี": 234, "ธี": 234,
you will see all these Models/Thai_graphclust_model*/weights.json were generated with "ธี" as the 234 item in the grapheme cluster, but that is not the case in either Thai_exclusive_graph_clust_ratio.npy nor Thai_graph_clust_ratio.npy
there are other cases, for example 29
~/lstm_word_segmentation$ ls Models/Thai_graphclust_model*/weights.json|xargs jq .dic|egrep ": 29," "ม่": 29, "ม่": 29, "ม่": 29,
but in Thai_graph_clust_ratio.npy '"': 29 and 'ม่': 30,
and in Thai_exclusive_graph_clust_ratio.npy 'ว่': 29, and. 'ม่': 28
On Wed, 27 Jan 2021 at 00:44, Frank Tang (譚永鋒) ftang@google.com wrote:
On Tue, 26 Jan 2021 at 23:51, Frank Tang (譚永鋒) ftang@google.com wrote:
So when I use Thai_graphclust_model5_heavy the embedding should be "grapheme_clusters_tf" right?
I think there is some bug there for grapheme_clusters_tf.
It does not make sense to me in some cluster: For example, for the input
พิธีส
we should get 121, 234, 22 as the cluster id but right now in python we got 121, 235, 22 as the cluster id
~/lstm_word_segmentation$ jq . Models/Thai_graphclust_model5_heavy/weights.json |egrep " (22|121|234|235)," "ส": 22, "พิ": 121, "ธี": 234, "ป่": 235,
My C++ code will get me 121,234,22 but that does not match the python one, this is before feeding into LSTM.
ok, I think I know what is going on. I am using the data from the json one, and the python code is using the npy inside the Data directory. Somehow they do not match.
Neither Thai_graph_clust_ratio.npy nor Thai_exclusive_graph_clust_ratio.npy will have "ธี" as 234 but somehow Models/Thai_graphclust_model5_heavy/weights.json has "ธี" as 234
On Thu, 21 Jan 2021 at 17:09, Sahand Farhoodi notifications@github.com wrote:
In fact, it is possible to get rid of the variable embedding for pick_lstm_model if it is guaranteed that any trained model in future follows the naming convection I explained in this link https://github.com/unicode-org/lstm_word_segmentation/blob/master/Models%20Specifications.md, but I just left it there because I wasn't sure if that's the case.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/lstm_word_segmentation/issues/10#issuecomment-765044914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2N2KMBT2DSSSYINNE5OJDS3DF5ZANCNFSM4WNWS76Q .
-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer
-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer
-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer
Yes, apparently an old version of dictionaries was on our shared Google drive and I didn't notice it. Sorry if it wasted some of your time. I updated the *.ratio files on our drive. I checked the updated file "Thai_graphclust_ratio.npy" and it seems to give the same numbers that you mentioned above.
So the python code that you ran was flawed (and I guess you get lower accuracy there), but whatever we had in the json files was up to date.
@sffc this should not affect our model performance in Rust, that's probably why we didn't spot it sooner.
hum... how about this. Could you submit a PR to change https://github.com/unicode-org/lstm_word_segmentation/tree/master/Data to what it should be?
On Wed, 27 Jan 2021 at 06:30, Sahand Farhoodi notifications@github.com wrote:
Yes, apparently an old version of dictionaries was on our shared Google drive and I didn't notice it. Sorry if it wasted some of your time. I updated the *.ratio files on our drive. I checked the updated file "Thai_graphclust_ratio.npy" and it seems to give the same numbers that you mentioned above.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/lstm_word_segmentation/issues/10#issuecomment-768323497, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2N2KOEWI3RPJYFCDAFL53S4APQNANCNFSM4WNWS76Q .
-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer
I made a commit that does this and left a comment for you there. I forgot to submit a PR, but I basically just changed the files and those lines of code that read/write dictionaries. Please see my commit.
I also updated our Google drive accordingly.
ok, thanks .let me try
On Thu, 28 Jan 2021 at 07:55, Sahand Farhoodi notifications@github.com wrote:
I made a commit https://github.com/unicode-org/lstm_word_segmentation/commit/4bb9e074e25c7dba03ff24310e6dc25cb168ea28 that does this and left a comment for you there. I forgot to submit a PR, but I basically just changed the files and those lines of code that stores dictionaries. Please see my commit.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/lstm_word_segmentation/issues/10#issuecomment-769182725, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2N2KL6KSWLKYSHD7OLPN3S4GCHFANCNFSM4WNWS76Q .
-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer
I have the following simple program to see how to run all different models under
https://github.com/unicode-org/lstm_word_segmentation/tree/master/Models
It currently work for Thai_codepoints_exclusive_model4_heavy but I have problem to figure out what the value need to be passed in for other model
Could you specify what values should be used for embedding, train_data and eval_data for the other models?
Burmese_codepoints_exclusive_model4_heavy Burmese_codepoints_exclusive_model5_heavy Burmese_codepoints_exclusive_model7_heavy Burmese_genvec1235_model4_heavy Burmese_graphclust_model4_heavy Burmese_graphclust_model5_heavy Burmese_graphclust_model7_heavy Thai_codepoints_exclusive_model4_heavy Thai_codepoints_exclusive_model5_heavy Thai_codepoints_exclusive_model7_heavy Thai_genvec123_model5_heavy Thai_graphclust_model4_heavy Thai_graphclust_model5_heavy Thai_graphclust_model7_heavy
or is there a simple way we can just have a simple function
get_lstm_model(model_name) on top of pick_lstm_model() and just fill the necessary parameter to call pick_lstm_model()