Open FrankYFTang opened 3 years ago
I believe the data files are generated from study_languages.py
. However, I think that they shouldn't be necessary for evaluating a model; they're used for training. However, it looks like the Python code won't run at all unless those files are present, because the import statements in constants.py
are failing. @SahandFarhoodi ?
There are data files needed to train and test files in the current version of my python code. Some of these are data used to train/test models (my.txt, BEST data, etc), and some of these are data files generated by my code that are used at the evaluation time as well, such as THAI_GRAPH_CLUST_RATIO
which is a python dictionary that contains the frequent grapheme clusters in Thai. You can use the functions I have in study_langauges.py
to generate this dictionary yourself, but you will still need the BEST data files.
By the end of my internship, I shared a google drive folder with Shane (called Dictionary Segmentation) that has all these files. I just shared the same folder with Frank.
There are data files needed to train and test files in the current version of my python code.
Should we at least check in github all the files needed to TEST / Eval the segmentation. I think we should not check in all the data which train the model, but for anything that are needed run AFTER the training, should we check them into github?
We shouldn't check in data files that are strongly coupled with the training data. Instead, it would be better design if the code didn't need those files to exist at all. Ideally the code should be able to pull what it needs directly from the model files.
I think the main data file that we need for the evaluation is the dictionary that has grapheme clusters in it (e.g. THAT_GRAPH_CLUST_RATIO
). This dictionary already exists in the model files as well (that's how we use it in Rust), but my Python code reads that file directly (not from the model file) because that made it much easier to develop and change the algorithm. In addition, these dictionaries are almost independent of the training data, because we just use the training data to count different grapheme clusters, and any text in Thai (even unsegmented) can be used for this purpose and will result in a similar dictionary (I tried this).
OK, so I think we should probably just check the ratio files into the repo then. Otherwise, someone who downloads the repo won't be able to run the code. Does that sound okay to you @SahandFarhoodi ?
Yes, I think that's the best solution.
I try to run the basic test but mostly failed
It seems you coded some data path in constatnts.py but those files do not exist.
ftang@ftang4:~/lstm_word_segmentation$ python3 test/test_helpers.py Traceback (most recent call last): File "test/test_helpers.py", line 3, in
from lstm_word_segmentation.helpers import is_ascii, diff_strings, sigmoid
File "/usr/local/google/home/ftang/lstm_word_segmentation/lstm_word_segmentation/helpers.py", line 2, in
from . import constants
File "/usr/local/google/home/ftang/lstm_word_segmentation/lstm_word_segmentation/constants.py", line 7, in
THAI_GRAPH_CLUST_RATIO = np.load(str(path), allow_pickle=True).item()
File "/usr/lib/python3/dist-packages/numpy/lib/npyio.py", line 416, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/google/home/ftang/lstm_word_segmentation/Data/Thai_graph_clust_ratio.npy'
ft