unicode-org / lstm_word_segmentation

Python code for training an LSTM model for word segmentation in Thai, Burmese, and similar languages.
Other
20 stars 9 forks source link

missing file in Data stated in constants.py #5

Open FrankYFTang opened 3 years ago

FrankYFTang commented 3 years ago

I try to run the basic test but mostly failed

It seems you coded some data path in constatnts.py but those files do not exist.

ftang@ftang4:~/lstm_word_segmentation$ python3 test/test_helpers.py Traceback (most recent call last): File "test/test_helpers.py", line 3, in from lstm_word_segmentation.helpers import is_ascii, diff_strings, sigmoid File "/usr/local/google/home/ftang/lstm_word_segmentation/lstm_word_segmentation/helpers.py", line 2, in from . import constants File "/usr/local/google/home/ftang/lstm_word_segmentation/lstm_word_segmentation/constants.py", line 7, in THAI_GRAPH_CLUST_RATIO = np.load(str(path), allow_pickle=True).item() File "/usr/lib/python3/dist-packages/numpy/lib/npyio.py", line 416, in load fid = stack.enter_context(open(os_fspath(file), "rb")) FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/google/home/ftang/lstm_word_segmentation/Data/Thai_graph_clust_ratio.npy' ft

sffc commented 3 years ago

I believe the data files are generated from study_languages.py. However, I think that they shouldn't be necessary for evaluating a model; they're used for training. However, it looks like the Python code won't run at all unless those files are present, because the import statements in constants.py are failing. @SahandFarhoodi ?

SahandFarhoodi commented 3 years ago

There are data files needed to train and test files in the current version of my python code. Some of these are data used to train/test models (my.txt, BEST data, etc), and some of these are data files generated by my code that are used at the evaluation time as well, such as THAI_GRAPH_CLUST_RATIO which is a python dictionary that contains the frequent grapheme clusters in Thai. You can use the functions I have in study_langauges.py to generate this dictionary yourself, but you will still need the BEST data files.

By the end of my internship, I shared a google drive folder with Shane (called Dictionary Segmentation) that has all these files. I just shared the same folder with Frank.

FrankYFTang commented 3 years ago

There are data files needed to train and test files in the current version of my python code.

Should we at least check in github all the files needed to TEST / Eval the segmentation. I think we should not check in all the data which train the model, but for anything that are needed run AFTER the training, should we check them into github?

sffc commented 3 years ago

We shouldn't check in data files that are strongly coupled with the training data. Instead, it would be better design if the code didn't need those files to exist at all. Ideally the code should be able to pull what it needs directly from the model files.

SahandFarhoodi commented 3 years ago

I think the main data file that we need for the evaluation is the dictionary that has grapheme clusters in it (e.g. THAT_GRAPH_CLUST_RATIO). This dictionary already exists in the model files as well (that's how we use it in Rust), but my Python code reads that file directly (not from the model file) because that made it much easier to develop and change the algorithm. In addition, these dictionaries are almost independent of the training data, because we just use the training data to count different grapheme clusters, and any text in Thai (even unsegmented) can be used for this purpose and will result in a similar dictionary (I tried this).

sffc commented 3 years ago

OK, so I think we should probably just check the ratio files into the repo then. Otherwise, someone who downloads the repo won't be able to run the code. Does that sound okay to you @SahandFarhoodi ?

SahandFarhoodi commented 3 years ago

Yes, I think that's the best solution.