unicode-org / lstm_word_segmentation

Python code for training an LSTM model for word segmentation in Thai, Burmese, and similar languages.
Other
20 stars 9 forks source link

pick_lstm_model parameters are too complicated to call #10

Open FrankYFTang opened 3 years ago

FrankYFTang commented 3 years ago

I have the following simple program to see how to run all different models under

https://github.com/unicode-org/lstm_word_segmentation/tree/master/Models

It currently work for Thai_codepoints_exclusive_model4_heavy but I have problem to figure out what the value need to be passed in for other model

# Lint as: python3
from lstm_word_segmentation.word_segmenter import pick_lstm_model
import sys, getopt

"""
Read a file and output segmented results
"""

def main(argv):
   inputfile = ''
   outputfile = ''
   try:
     opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="])
   except getopt.GetoptError:
     print('test.py -i <inputfile> -o <outputfile>')
     sys.exit(2)
   for opt, arg in opts:
      if opt == '-h':
        print('test.py -i <inputfile> -o <outputfile>')
        sys.exit()
      elif opt in ("-i", "--ifile"):
        inputfile = arg
      elif opt in ("-o", "--ofile"):
        outputfile = arg
   print('Input file is "', inputfile)
   print('Output file is "', outputfile)

   file1 = open(inputfile, 'r')
   Lines = file1.readlines()

   word_segmenter = pick_lstm_model(model_name="Thai_codepoints_exclusive_model4_heavy",
                                    embedding="codepoints",
                                    train_data="exclusive BEST",
                                    eval_data="exclusive BEST")

   count = 0
   # Strips the newline character
   for line in Lines:
       line = line.strip()
       print(line)
       print(word_segmenter.segment_arbitrary_line(line))

if __name__ == "__main__":
    main(sys.argv[1:])

Could you specify what values should be used for embedding, train_data and eval_data for the other models?

Burmese_codepoints_exclusive_model4_heavy Burmese_codepoints_exclusive_model5_heavy Burmese_codepoints_exclusive_model7_heavy Burmese_genvec1235_model4_heavy Burmese_graphclust_model4_heavy Burmese_graphclust_model5_heavy Burmese_graphclust_model7_heavy Thai_codepoints_exclusive_model4_heavy Thai_codepoints_exclusive_model5_heavy Thai_codepoints_exclusive_model7_heavy Thai_genvec123_model5_heavy Thai_graphclust_model4_heavy Thai_graphclust_model5_heavy Thai_graphclust_model7_heavy

or is there a simple way we can just have a simple function

get_lstm_model(model_name) on top of pick_lstm_model() and just fill the necessary parameter to call pick_lstm_model()

SahandFarhoodi commented 3 years ago

In this document under "input_name" I explain the relationship between the name of the models and hyperparameters. For pick_lstm_model it's actually much simpler: embedding should be the embedding that appears in name of the model, e.g. if you have codepoints in name of the model we need embedding="codepoints" and if we have graphclust in name of the model we need embedding="grapheme_clusters_tf". The choice of train_data and eval_data shouldn't be important if you are segmenting arbitrary lines (by calling segment_arbitrary_line function) which is what I see in your code. However, if you want to train and evaluate using BEST data or my.txt file, you need to set train_data and eval_data to appropriate values that I explained in the link above.

SahandFarhoodi commented 3 years ago

In fact, it is possible to get rid of the variable embedding for pick_lstm_model if it is guaranteed that any trained model in future follows the naming convection I explained in this link, but I just left it there because I wasn't sure if that's the case.

FrankYFTang commented 3 years ago

So when I use Thai_graphclust_model5_heavy the embedding should be "grapheme_clusters_tf" right?

I think there is some bug there for grapheme_clusters_tf.

It does not make sense to me in some cluster: For example, for the input

พิธีส

we should get 121, 234, 22 as the cluster id but right now in python we got 121, 235, 22 as the cluster id

~/lstm_word_segmentation$ jq . Models/Thai_graphclust_model5_heavy/weights.json |egrep " (22|121|234|235)," "ส": 22, "พิ": 121, "ธี": 234, "ป่": 235,

My C++ code will get me 121,234,22 but that does not match the python one, this is before feeding into LSTM.

On Thu, 21 Jan 2021 at 17:09, Sahand Farhoodi notifications@github.com wrote:

In fact, it is possible to get rid of the variable embedding for pick_lstm_model if it is guaranteed that any trained model in future follows the naming convection I explained in this link https://github.com/unicode-org/lstm_word_segmentation/blob/master/Models%20Specifications.md, but I just left it there because I wasn't sure if that's the case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/lstm_word_segmentation/issues/10#issuecomment-765044914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2N2KMBT2DSSSYINNE5OJDS3DF5ZANCNFSM4WNWS76Q .

-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer

FrankYFTang commented 3 years ago

On Tue, 26 Jan 2021 at 23:51, Frank Tang (譚永鋒) ftang@google.com wrote:

So when I use Thai_graphclust_model5_heavy the embedding should be "grapheme_clusters_tf" right?

I think there is some bug there for grapheme_clusters_tf.

It does not make sense to me in some cluster: For example, for the input

พิธีส

we should get 121, 234, 22 as the cluster id but right now in python we got 121, 235, 22 as the cluster id

~/lstm_word_segmentation$ jq . Models/Thai_graphclust_model5_heavy/weights.json |egrep " (22|121|234|235)," "ส": 22, "พิ": 121, "ธี": 234, "ป่": 235,

My C++ code will get me 121,234,22 but that does not match the python one, this is before feeding into LSTM.

ok, I think I know what is going on. I am using the data from the json one, and the python code is using the npy inside the Data directory. Somehow they do not match.

Neither Thai_graph_clust_ratio.npy nor Thai_exclusive_graph_clust_ratio.npy will have "ธี" as 234 but somehow Models/Thai_graphclust_model5_heavy/weights.json has "ธี" as 234

On Thu, 21 Jan 2021 at 17:09, Sahand Farhoodi notifications@github.com wrote:

In fact, it is possible to get rid of the variable embedding for pick_lstm_model if it is guaranteed that any trained model in future follows the naming convection I explained in this link https://github.com/unicode-org/lstm_word_segmentation/blob/master/Models%20Specifications.md, but I just left it there because I wasn't sure if that's the case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/lstm_word_segmentation/issues/10#issuecomment-765044914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2N2KMBT2DSSSYINNE5OJDS3DF5ZANCNFSM4WNWS76Q .

-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer

-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer

FrankYFTang commented 3 years ago

I am pretty sure Models/Thai_graphclust_model*/weights.json were not generated by neither the current version of Thai_exclusive_graph_clust_ratio.npy nor Thai_graph_clust_ratio.npy in the Data directory and I am not sure what will be the output quality from the python now with the current version of these two files in the Data directory.

Somehow most of the order are the same but about 5-10% are different

check ~/lstm_word_segmentation$ ls Models/Thai_graphclust_model*/weights.json|xargs jq .dic|egrep ": 234," "ธี": 234, "ธี": 234, "ธี": 234,

you will see all these Models/Thai_graphclust_model*/weights.json were generated with "ธี" as the 234 item in the grapheme cluster, but that is not the case in either Thai_exclusive_graph_clust_ratio.npy nor Thai_graph_clust_ratio.npy

there are other cases, for example 29

~/lstm_word_segmentation$ ls Models/Thai_graphclust_model*/weights.json|xargs jq .dic|egrep ": 29," "ม่": 29, "ม่": 29, "ม่": 29,

but in Thai_graph_clust_ratio.npy '"': 29 and 'ม่': 30,

and in Thai_exclusive_graph_clust_ratio.npy 'ว่': 29, and. 'ม่': 28

On Wed, 27 Jan 2021 at 00:44, Frank Tang (譚永鋒) ftang@google.com wrote:

On Tue, 26 Jan 2021 at 23:51, Frank Tang (譚永鋒) ftang@google.com wrote:

So when I use Thai_graphclust_model5_heavy the embedding should be "grapheme_clusters_tf" right?

I think there is some bug there for grapheme_clusters_tf.

It does not make sense to me in some cluster: For example, for the input

พิธีส

we should get 121, 234, 22 as the cluster id but right now in python we got 121, 235, 22 as the cluster id

~/lstm_word_segmentation$ jq . Models/Thai_graphclust_model5_heavy/weights.json |egrep " (22|121|234|235)," "ส": 22, "พิ": 121, "ธี": 234, "ป่": 235,

My C++ code will get me 121,234,22 but that does not match the python one, this is before feeding into LSTM.

ok, I think I know what is going on. I am using the data from the json one, and the python code is using the npy inside the Data directory. Somehow they do not match.

Neither Thai_graph_clust_ratio.npy nor Thai_exclusive_graph_clust_ratio.npy will have "ธี" as 234 but somehow Models/Thai_graphclust_model5_heavy/weights.json has "ธี" as 234

On Thu, 21 Jan 2021 at 17:09, Sahand Farhoodi notifications@github.com wrote:

In fact, it is possible to get rid of the variable embedding for pick_lstm_model if it is guaranteed that any trained model in future follows the naming convection I explained in this link https://github.com/unicode-org/lstm_word_segmentation/blob/master/Models%20Specifications.md, but I just left it there because I wasn't sure if that's the case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/lstm_word_segmentation/issues/10#issuecomment-765044914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2N2KMBT2DSSSYINNE5OJDS3DF5ZANCNFSM4WNWS76Q .

-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer

-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer

-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer

SahandFarhoodi commented 3 years ago

Yes, apparently an old version of dictionaries was on our shared Google drive and I didn't notice it. Sorry if it wasted some of your time. I updated the *.ratio files on our drive. I checked the updated file "Thai_graphclust_ratio.npy" and it seems to give the same numbers that you mentioned above.

So the python code that you ran was flawed (and I guess you get lower accuracy there), but whatever we had in the json files was up to date.

@sffc this should not affect our model performance in Rust, that's probably why we didn't spot it sooner.

FrankYFTang commented 3 years ago

hum... how about this. Could you submit a PR to change https://github.com/unicode-org/lstm_word_segmentation/tree/master/Data to what it should be?

On Wed, 27 Jan 2021 at 06:30, Sahand Farhoodi notifications@github.com wrote:

Yes, apparently an old version of dictionaries was on our shared Google drive and I didn't notice it. Sorry if it wasted some of your time. I updated the *.ratio files on our drive. I checked the updated file "Thai_graphclust_ratio.npy" and it seems to give the same numbers that you mentioned above.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/lstm_word_segmentation/issues/10#issuecomment-768323497, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2N2KOEWI3RPJYFCDAFL53S4APQNANCNFSM4WNWS76Q .

-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer

SahandFarhoodi commented 3 years ago

I made a commit that does this and left a comment for you there. I forgot to submit a PR, but I basically just changed the files and those lines of code that read/write dictionaries. Please see my commit.

I also updated our Google drive accordingly.

FrankYFTang commented 3 years ago

ok, thanks .let me try

On Thu, 28 Jan 2021 at 07:55, Sahand Farhoodi notifications@github.com wrote:

I made a commit https://github.com/unicode-org/lstm_word_segmentation/commit/4bb9e074e25c7dba03ff24310e6dc25cb168ea28 that does this and left a comment for you there. I forgot to submit a PR, but I basically just changed the files and those lines of code that stores dictionaries. Please see my commit.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/lstm_word_segmentation/issues/10#issuecomment-769182725, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2N2KL6KSWLKYSHD7OLPN3S4GCHFANCNFSM4WNWS76Q .

-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer