rkcosmos / deepcut

A Thai word tokenization library using Deep Neural Network
MIT License
420 stars 96 forks source link

Semi-supervised training #10

Closed sura1997 closed 7 years ago

sura1997 commented 7 years ago

Curious as to what's your plan to further enhance this by doing semi-supervised training. Anything specific you have in mind already? Any help that might be needed?

rkcosmos commented 7 years ago

The steps are as follow,

  1. prepare new big unlabeled dataset preferably with diverse content. This can be done by web crawling or just copy and paste any Thai text and put them in .txt file.
  2. tokenize this dataset with our program and treat this dataset as pseudo-labeled.
  3. combine BEST corpus with pseudo-label dataset and train model again. This time, neural network will learn something new from pseudo-labeled dataset.

If anyone want to help, contributing in the first step would be appreciated.

sura1997 commented 7 years ago

On step one. Have you done that with Thai Wikipedia yet? I downloaded it and tokenized to do something else with my project. I was using another tokenizer which was not really accurate. I can try again with deepcut if you haven't done so.

Also for step 2 and 3, I'm not clear on how to that process will help. Since deepcut has about 90% similarity in tokenization result with BEST, right? Unless, we re-label the mistakes made during step 2. Otherwise, we could get the 10% mistakes into the model to set that as the desired outcome which could tilt the model into lower accuracy number?

Maybe I just don't understand the whole thing. It will be great if you can help elaborate to clarify this process (or providing some references so that I can learn more). Thank you.

rkcosmos commented 7 years ago

I would love to get data on Thai wikipedia. How large is the file? If it's not too large, pls send it to r.kittinaradorn (at) gmail.com.

The reason behind step 2 and 3 is that hidden layer of neural networks learn not only to match input with label but also other things about data as well. By feeding the networks proper amount of diverse data even with imperfect label, the network can be more clever in creating internal representation (hidden layer) and thus better overall performance.

sura1997 commented 7 years ago

Let me run again using deepcut and see how big the data is. All I remember is that it took many hours to process. What format should I generate the file with? Using pipe as the word delimiter? What about sentence delimiter? Space for that like "| |"? If you can give me a sample file, I can make sure my output can be easily used in your training.

rkcosmos commented 7 years ago

article_00001.txt

this is a sample training file from BEST corpus, please ignore any tag like < NE >...< /NE > or < AB >...< /AB > since deepcut cannot generate these tag yet.

titipata commented 7 years ago

There is Thai Wikipedia link in this repo: https://github.com/kobkrit/nlp_thai_resources. The unzipped size of the file is 206 MB. There are few more tokenized dataset from wannaphongcom but released under Creative Commons license, not sure if we can use them freely.

sura1997 commented 7 years ago

For Thai Wikipedia, it's a single 1.5GB xml file that contains meta data in xml and text in wiki format. If you already have a code to clean that up all up to get to the actual corpus, please let me know, so I don't need to run mine anymore. If you don't have that, I can help generate the BEST-like format text file from that wiki xml.

For the licensing concerns, I am also unclear about the licensing on this project too which could be a bigger concern. From the code perspective, it's clear. Having MIT license is perfect to be freely open source (I assume that it is the goal of this project to have freely open source library that anyone can use for any purpose). The source code part is fine and we can keep the it as MIT license.

From the data perspective, because the model's weights are trained from BEST corpus, NECTEC could later come and argue that the weights are derived works of BEST which should be bound to the same CC-BY-NC-SA license requirements of BEST. That means, the data (and arguably the weights) is prohibited to be used in commercial settings due to the "NC" condition that NECTEC imposed in their license. That would be counter to the goal of freely open source of this project.

In term of what kind of licensing that is widely accepted for this kind of open source projects, I would say CC-BY-SA is the common denominator as acceptable for all for data licensing. The same license is also used by Wikipedia. So if we use Thai Wikipedia as a corpus, then our model might be bound to the CC-BY-SA anyway.

This means, if we want to do it right, the license for the current best_cnn3.h5 might need to be CC-BY-NC-SA and provide attribution to BEST from NECTEC.

We may need to create another set of of weights that are not trained from BEST for us to be able to license the weights under CC-BY-SA. We can use other free method of tokenization (like dictionary based) with lower accuracy to generate a small seed of corpus from Thai Wikipedia and use for training. I think we can still test our pre-trained model with BEST corpus to know the accuracy based on the BEST benchmark without having to worry about their license as we wouldn't use that as part of the training (we just use to check our model result). We can improve the accuracy further by acquiring more CC-BY-SA corpora.

Any thoughts on this?

titipata commented 7 years ago

@sura1997, I vote that we want to have MIT license and not bound to others License. I guess @rkcosmos can decide on that. Hopefully, he's on for MIT too :P

I can explore the wikipedia dataset and update with you here later on. Note that if we use other method to tokenize it, we still have to check the data quality of the corpus.

sura1997 commented 7 years ago

Ideally, yes, we want to have only MIT license for both code and data in this project. If that's possible, it is the most open license we can give to people to use. However, I'm not sure whether that is technically feasible if we still use some of the CC-BY-SA corpora. For example, if we use Wikipedia corpus, their license has it that any derived work has to be licensed by the same CC-BY-SA license (under the "SA" share-alike cause).

I have seen another open source project that creates word vectors out of CC-BY-SA corpora. Their project even combines inputs from many sources (sources are mixed with completely free license or CC-BY-SA) to come up with the final model. Their project has two licenses, one for the code and the other for the model data (the final word vectors). They kept the code very open like MIT license, and the word vectors license is bounded to the requirements of the licenses of the original corpora used for training, which is CC-BY-SA. I'm just thinking the situation is similar to what we have here.

korakot commented 7 years ago

I think we can use MIT license for the data. This fall under the 'Fair Use' of the corpora. http://fairuse.stanford.edu/overview/fair-use/four-factors/

The 4 criteria of Fair Use are

  1. the purpose and character of your use

    • Transformation, adding new meaning
    • Value added, new insight and understanding
    • Same with parody and literature criticism
  2. The Nature of the Copyrighted Work

    • The corpora is 'factual' not fiction/novel
    • It's already published freely
  3. The Amount and Substantiality of the Portion Taken

    • We take very little from the works, only the calculation result to make a model
  4. The Effect of the Use Upon the Potential Market

    • The work is not on sale. Our model doesn't affect the profit of the work.
    • Our model is not commercial, we don't gain profit from it.

So, I'm sure we can win in court. ^_^ It's up to us whether we want to take this small risk.

sura1997 commented 7 years ago

I don't think we will need to go to court anytime soon :) I just see that deepcut has a great potential even for commercial use which will help accelerate the whole Thai NLP movement both on academic and commercial sides (I think it is currently lacking as major research has been locked up in "not for commercial use" mandated by NECTEC license). If we use the Fair Use, then we (or the companies who use our library to build their products) may not be able to claim #4.

For example, Facebook uses Wikipedia corpus to feed into their fastText model. The fastText code itself is free under BSD license (https://github.com/facebookresearch/fastText/blob/master/LICENSE), but they distribute the pre-trained word vectors separately under CC-BY-SA to follow the Wikipedia's Share-Alike license requirement (https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md) even though they transformed millions of pages into 300 dimensions of hundred thousand words.

korakot commented 7 years ago

4 consider the commercial impact to the copyright work. (Eg. the corpora)

Even if a company use deepcut in their commercial activity, it won't impact the 'sale' of the corpora at all. The corpora isn't even on sale. And the new product is totally unrelated to the corpora business.

I looked into fastText, they decide on the safe side and consider their model as a derivative work of a CC-BY-SA data. So, as a good manner to create goodwill, they distribute it on the same license. We can do the same too, if we want to be polite.

I only suggest that we can do it in the impolite, but legal way as well. ^_^

titipata commented 7 years ago

@sura1997 actually bring up the good point. It has some potential to use for commercial use. I think we're definitely fine with MIT license as a repository. However, I'm not sure if NECTEC can claim anything if someone bring deepcut for the commercial usage.

But yeah, I agree with @korakot that the sale will not be coming from dataset itself but model derived from the dataset. Therefore, that should be fine I think.

PS. if someone can ask representative from NECTEC to clarify, that would be great too. I'm sure you guys know someone from there?

rkcosmos commented 7 years ago

The code will remain with MIT license as everyone suggest. As for model data, I would love it to be MIT or at very least CC-BY-SA. So everyone both from academic and commercial side can participate to accelerate the whole research. I think we can politely ask NECTEC to join the project as they have already contributed their corpus. This would create win-win-win situation where everyone should be happy. Since @korakot has a good relationship with NECTEC, can you try to initiate this collaboration? Hopefully when we are all on the same team, NECTEC may release other corpus for our model.

Regarding Thai Wikipedia data, I don't have it in BEST format yet, so it would be nice if @sura1997 can clean and publish it somewhere everyone can use. :)

korakot commented 7 years ago

Sure. I will start the talk with them. I don't know how fast/responsive they will be.

sura1997 commented 7 years ago

I thought @titipata will look into re-formatting Thai Wikipedia into BEST format. If that is not started yet, no worries, I already have the code to do that. I just need to add deepcut tokenization to it and put the pipes. I can do that too later this week if we are not in a hurry. I just have another project deadline early this week :)

titipata commented 7 years ago

@korakot, that sounds great. It's about time to let this NLP development in corporation hand not just in academic world :)

@sura1997, that perfect! Just to make sure, you can run deepcut model on Wikipedia. However, you cannot directly add them as a training set tho cause the model won't improve by seeing what they just tokenize. I think we should start with the error we made on testing set in BEST corpus. And we find example dataset from Wikipedia to add on top so that to mitigate the error we made in BEST corpus.

korakot commented 7 years ago

In addition to Wikipedia, there's this HSE corpus. http://web-corpora.net/ThaiCorpus/search/

It includes both Wikipedia and other texts not restricted by copyright(mostly news sources). You can try and have a look.

korakot commented 7 years ago

I have just ask Dr.Chai who's the director of IT research department.

He is happy too cooperate with us. But he confirmed we don't need to ask Nectec for permission. The model we train belong to us. We can release it in whatever license we choose.

titipata commented 7 years ago

@korakot, that's a great news! I'm glad that this works out pretty easily!

rkcosmos commented 7 years ago

My best explanation of how semi-supervised learnings work without manual work

semi-supervised.pdf semi-supervised2.pdf

titipata commented 7 years ago

So I did some error analysis on the test set to see the error deepcut makes. I put the snippet and result I found below.

import deepcut
import numpy as np
from train import *

x_test_char, x_test_type, y_test = prepare_feature('cleaned_data', option='test')
y_predict = deepcut.deepcut.model.predict([x_test_char[:100000], x_test_type[:100000]])
y_predict = (y_predict.ravel() > 0.5).astype(int)
error_position = np.where(y_predict != y_test[:100000])[0] # see position where the model predicts wrong

best_processed_path = 'cleaned_data'
article_types = ['article', 'encyclopedia', 'news', 'novel']
option = 'test'
df = []
for article_type in article_types:
    df.append(pd.read_csv(os.path.join(best_processed_path, option, 'df_best_{}_{}.csv'.format(article_type, option))))
df = pd.concat(df)

# list of wrong prediction
def compare_tokenize_pair(df):
    text = ''.join(df.char)
    word_end = list(df.target.astype(int))[1:] + [1]

    # tokenize from BEST dataset
    tokens = []
    word = ''
    for char, w_e in zip(text, word_end):
        word += char
        if w_e:
            tokens.append(word)
            word = ''

    tokenized_best = '|'.join(tokens)
    tokenized_deepcut = '|'.join(deepcut.tokenize(text))
    return {'best': tokenized_best, 'deepcut': tokenized_deepcut}

error_list = [df.iloc[e - 20: e + 20] for e in error_position]
error_pairs = [compare_tokenize_pair(edf) for edf in error_list] # error made from deepcut, 981 pairs from 100,000 samples

Here are sample texts that the model tokenize them wrong:

'ผู้ประพันธ์เป็นร่างทรงของงานเขียน' # ร่างทรง in test set >> deepcut gets ร่าง|ทรง
'ที่มนุษย์นำมาใช้สื่อสารซึ่งกันและกันเบื้องต้นระหว่างคนในชุมชนก็คือ "การพูด" (speaking)ปิแอร์ บูดิเออร์ (Pierre Bourdieu)' # `)ปิแอร์` is wrong but seems like a some preprocessing problem
'จะไม่สามารถมองได้ทีเดียวครบทั้งสามมิติ' # deepcut gets ทีเดียว but the test set tokenize as ที|เดียว
'ก็เป็นสิ่งที่ถูกรู้แบบ "active"' # deepcut gets รู้แบบ instead of รู้|แบบ
'การมองแบบขาดๆ หายๆ' # test set has ขาดๆ หายๆ as one words >> deepcut gets ขาด|ๆ|หาย|ๆ
'วิเคราะห์) ของ ฌากส์ ลากอง' # deepcut gets ฌากส์| |ลากอง but the test set is ฌากส์ ลากอง as one word, name entity with space bar problem
'จาคอบ ดี. เบเคนสไตน์' # name entity problem again
'1 มิ.ย.-ก.ย. 2540' # this new model with removing `NE`, `AE` should fix this issue
'ซึ่งแนวคิดนี้เลียวทาร์ได้หยิบยืมมาจาก' # deepcut gets เลียว|ทาร์ได้

Seems like most of the errors are from name entity with space bar. Once model makes an error with name entity, it will make errors again too so the performance decreases when we remove <NE> tag.

rkcosmos commented 7 years ago

)ปิแอร์ case is very strange, I expect neuralnet to be able to learn such a simple rule like starting new word every time it see ')'. I will create another weight with name_entity and abbreviation removed.

titipata commented 7 years ago

Yeah, that's what I think the model will learn also. Most of the time, the model tries to predict ( and space as a new word. However, ปิแอร์ might be an exception in this particular case. Note that name entity always comes with spave, it might be hard to get the prediction right tho.

I'll spend some time tomorrow write down code to check these error systematically to see what's the current model makes mistake on. If those are mostly name entity, we can figure out ways to deal with it.

sura1997 commented 7 years ago

My understanding from what you are saying is the model learns from corpus with proper text which rarely has ) without a space after it. We should do preprocessing on the input, right?

If we can do that, we might want to add some preprocessing on a few other things like the following cases, maybe? Not sure we want to support that or it's a responsibility of the app to do basic cleaning. If we don't want to waste time do cleaning in the tokenization, maybe we can offer a separate function to help app developers to do preprocessing? Maybe I'm too picky :)

At least I found these: 1) LF or CR. The model knows to break other symbols like -, /, ๆ etc. but not LF/CR. E.g.
กินเยอะๆ หายไวๆ Got it as ['กิน', 'เยอะๆ\n', 'หาย', 'ไว', 'ๆ']

2) mixture of Thai and English without spaces. E.g. คนนั่งโต๊ะโน้นหน้าเหมือนjohnny deppเลยหวะ ['คน', 'นั่ง', 'โต๊ะ', 'โน้น', 'หน้า', 'เหมือ', 'น', 'johnny', ' ', 'depp', 'เลย', 'หวะ']

Well, we could also have 'jonny depp' together as a named entity but I guess we don't need to concern much about segmentation of English words as it is out of scope. At least preprocessing to add spaces between Thai and English boundaries can help the Thai tokenization to come out as 'เหมือน' instead of 'เหมือ', 'น'.

You know when we have to nitpicking small details like this, it means deepcut is pretty good already :)

sura1997 commented 7 years ago

@rkcosmos thanks for the visualization on your proposed semi-supervised learning process.

I'm not sure whether I understand your explanation correctly but I will try my best to explain my understanding here.

Let's say we have a labeled corpus (e.g. BEST) and it happens to be clustered with squares mostly in quad I and IV, and triangles mostly in quad II and III. Then our model is trained, so the decision boundary is a vertical line. We got 98% right with some 2% misclassified.

Now, let say, another unlabeled corpus (e.g. Wikipedia). The nature of this new corpus happens to have clustering of triangles and squares in the +45 degree boundary (most square in quad I, some in quad II and IV; most triangles in quad III and some in II and IV; all of which are along a boundary line of +45 degree). With the hope that the model will start to adjust the weights to fit the perfect boundary in figure 1. Since they are unlabeled, we don't know which ones are actually squares or triangles. So we start to automatically label them based on our model (vertical line). So some amount of data will be mislabeled which could be much more than 2% assuming that the nature of the new data differs enough from the previously trained corpus.

Now, we try to randomly introduce some Wikipedia auto-labeled data to the model. If we make the random function to pick those be evenly, the ratio of errors will be the same as in the initial auto-labelling step.

After training, I don't think the model would change the boundary line from the original vertical line because the newly introduced auto-label data will help reenforcing the existing vertical boundary line with more inputs. That is due to the fact that there is no new squares from Wikipedia data that is on quad II or triangles on quad IV because they were mislabeled in the auto-labeling step.

May be I missed some parts. I really appreciate your patients in trying to explain these steps.

titipata commented 7 years ago

@sura1997 Yeah, deepcut tries to cut name entity with space separately e.g. จอห์นนี่ เดปป์ to [จอห์นนี่, เดปป์] where this can be Thai words also. That's why the testing performance is lower in this notebook.

I'm trying to say that we should take a look at error we made in test set first which are not name entity mistake. It might be hard to capture จอห์นนี่ เดปป์ as one word because model try to separate `. However, things likeทีเดียว,รู้แบบ` and more, we can find extra training dataset to mitigate it.

korakot commented 7 years ago

For extra training dataset, the Thai National Corpus can be used with respect to a specific phrase. It's a resource aimed for linguists, so no download or API access. We can do search though its online query, or direct URL access like this:

www.arts.chula.ac.th/~ling/TNCII/x3.php?p=รู้&w2=แบบ&wl=0&wr=1

It will look for 2 words "รู้" and "แบบ" where the second word appear in a window-left=0, windown-right=1. It means "แบบ" just follow "รู้", but we can search using a longer distance context if we wish.

Or you can just use google to find "รู้แบบ", but TNC is designed to be "balanced" which should give us a more representative sample.

rkcosmos commented 7 years ago

http://rinuboney.github.io/2016/01/19/ladder-network.html

this link offer some explanations similar to what I drew.

rkcosmos commented 7 years ago

I just uploaded new weight, @titipata pls run error analysis again with this new weight to see if there's any improvement on name entity.

titipata commented 7 years ago

It works a lot better! I ran the script up there for error analysis for first 100,000 rows. Error reduce from 981 cases to 571 cases. Here is the error file from the current weight and the error file from new weight. I upload files with 2 columns one from BEST and another from deepcut.

And now these cases work!

deepcut.tokenize('วิเคราะห์) ของ ฌากส์ ลากอง') >> ['วิเคราะห์', ')', ' ', 'ของ', ' ', 'ฌากส์ ลากอง']
deepcut.tokenize('1 มิ.ย.-ก.ย. 2540') >> ['1', ' ', 'มิ.ย.', '-', 'ก.ย.', ' ', '2540']

This case fails in old weight (tokenize ม.มหิดล fails in current weight) and the new weight fixes it pretty well.

deepcut.tokenize('ละวัฒนธรรมเพื่อพัฒนาชนบท ม.มหิดล. หน้า 7') 

From my eyeball, we definitely should go with the new weight.

sura1997 commented 7 years ago

I'm sorry for the delay. My project is taking a bit longer. The new deadline is by the end of August. After that I will have time to contribute to this project.

rkcosmos commented 7 years ago

@sura1997 It's ok. I have already done the preprocessing part. It's now training. I'll update the result to everyone once it's done.

rkcosmos commented 7 years ago

After long procrastination, new weight is trained with around 2.2GB unlabeled text using semi-supervised learning, the performance is a little bit better in my local validation set. I also added custom_dictionary and upload everything to Pypi.