seq2seq based segmentation

sebastian-nehrdich commented 6 years ago

Namaste! I Am Sebastian, an indologist/buddhologist from Germany who is mainly working with Sanskrit and Tibetan bilingual texts (translation of Buddhist material). I am currently thinking about ways to automatize sentence matching between Sanskrit and Tibetan, because this is a highly time consuming task but frequently necessary for our work. I want to say thank you to the work that has been done here, this project so far has produced some very exciting results! I recently stumbled across this repository and want to link it here, because it seems to be a very interesting approach to the problem: https://github.com/cvikasreddy/skt I am not yet finished with training the data, but it would be interesting to see how that segmenter performs compared to the "traditional" domain specific approach. As soon as I have some figures I will post them here!

avinashvarna commented 6 years ago

Namaste Sebastian,

Thanks for bringing this interesting project to our attention. Seq2Seq based parsing was on our roadmap, but glad to see that someone has already implemented it and shared the source code so that others can build off of it. Do keep us posted on your progress, and we will try to investigate it as well.

sebastian-nehrdich commented 6 years ago

Namaste, In the next days (or rather weeks) I will throw the reverse-engineered data of the DCS on the transformer (code taken from: https://github.com/OpenNMT/OpenNMT-tf) which is a state of the art implementation of seq2seq learning. I will train that on a single Titan X for ~4 days (that's what I calculated given the amount of data) and upload the results here. If anybody here is interested in helping me, I am still thinking about how to convert the grammatical tags of the reverse-engineered DCS into unique tokens (numbers for example). That will be my contribution to this problem in this OSS-repository. In order to keep you up to date: At the moment Dr. Hellwig and me are working on a new segmenter that will incooperate RNN and is based on the DCS-data, the paper might be out somewhere around November if everything goes as expected. Another thing that came to my mind: Sandhi/CPD split data for training can also be found in existing etexts. The past days I found about 6mb of sandhi+compound-seperated etexts within my etext-folder (largely taken from GRETIL and other related collections). If one digs these source more thoroughly I am sure another 5 or 10mb could surface. That's already something to start with. :)

avinashvarna commented 6 years ago

Namaste If you are referring to the reverse-engineered DCS data from here or here, I have spent some time looking at the tags, thinking about the same questions you are asking, and have collected some preliminary data in the past. There are only about ~300 unique tags in those datasets, so mapping them to unique numbers can be implemented using a simple lookup table (a dict in python for example). I will be more than happy to share my local scripts if you need help.

However, the handling of noun forms derived from verbs is not ideal (at least my opinion) in the DCS tags. They are just marked as past participle, etc. without any information about the linga, vibhakti, vachana. Are you planning to handle these differently (e.g. perform additional reverse engineering?).

Do you happen to have the links for the sources of sandhi+compound-seperated etexts? If so, please share them, and we will look into incorporating it in the future.

Are you planning to share the results of your research (the RNN segmenter you referred to) eventually? I think it would definitely be useful to the broader community.

sebastian-nehrdich commented 6 years ago

It would be great if you can share the scripts, the same work does not need to be done twice! One source for sandhi-split-data is here: http://sanskrit.uohyd.ac.in/Corpus/ Another way is to download the gretil archive (http://gretil.sub.uni-goettingen.de/gret_utf.htm#Index), go through it manually and see if sandhi/cpd-split has been done or not. That's what I am doing right now, if I am done with that work I can upload the files here. It should not be a problem to resandhi them, I wrote a small library in emacs-lisp some years ago to reapply sandhi and I think it works acceptable at this point, so generating the unsandhied versions of the textfiles is not a huge problem in my eyes. Regarding the release of the new segmenter, if the results are good there will certainly a paper about it, so let's see!

avinashvarna commented 6 years ago

It is just a few lines of code to grab all the tags, but I've uploaded it [here].(https://github.com/avinashvarna/dcs_wrapper/blob/master/examples/dcs_all_tags.py) Building a vocabulary from the tags to should be fairly easy (e.g. using tools that are already part of openNMT-tf)

We are already aware of the UoHyd corpus and are using it in our tests, but thanks for mentioning it. If you think of any other data that could be useful, please do let us know.

In addition to the paper, it would be good if you could share the code and data if possible.

avinashvarna commented 6 years ago

I've trained a seq2seq model using openNMT-tf on the training data in the dataset mentioned in @dhamma-basti's original post. This was also on our roadmap (as mentioned in the README).

I've already managed to get a BLEU score of 80.85 and 0.94 on the test data. This compares very favorably with the scores obtained by the current implementation of the lexical analyzer (https://github.com/kmadathil/sanskrit_parser/issues/93#issuecomment-396025499).

I will upload the code and model soon. The model files are ~120MB though, and so far my attempts to figure out how to reduce the size have not been fruitful. Need to figure out a good way to share them. I think git allows that size, but it is close to the limit (150MB).

In the mean time, I will try changing the network and/or vocabulary size and see if I can retain the accuracy at a lower parameter size.

sebastian-nehrdich commented 6 years ago

Namaste,

What kind of model did you use? I worked a lot with the Transformer implementation of opennmt. However it stopped working at some point (I assume there is some miscommunication between tensorflow and the code from opennmt). I couldn't fix that so I moved the code base to tensor2tensor (which is the reference implementation of the transformer). That gave by far the best results. I really would to rerun that script with the stemming-data from the DCS that is pubilcy available, but so far didn't have the time to prepare a train/test-data-set. In case you should have that lying around, I could just immediately start to train that. Tensor2tensor is tedious to set up (especially when training with own data), but I am totally willing to share my code if that helps anybody. Training time on a Maxwell Titan X GPU is about 55 hours for best results (on a dataset of the size of the DCS). I have seen huge increases in performance by reducing the vocab-size, but couldn't go below a vocab-size of 5000. This is due to a bug in tensor2tensor, their code doesn't allow smaller vocab-sizes witht he internal subword-segmenter. However by using an external one like sentencepiece there should be no problem to go even lower, but that requires rewiring the configuration of t2t which I didn't try out yet, and this whole thing is also quite time-consuming, not to say that training a transformer blocks the GPU from doing anything else for days!

codito commented 6 years ago

Need to figure out a good way to share them. I think git allows that size, but it is close to the limit (150MB).

I've seen few tools use github releases for this. We get versioning with releases. E.g. create a repository for models, attach the models to a new release (for every version). A script could download them locally for use as needed.

See https://github.com/explosion/spacy-models for example.

avinashvarna commented 6 years ago

I trained a model with a smaller vocabulary, without a significant impact on accuracy (it actually went up by a miniscule amount BLEU=83.74 and chrF = 0.96). The exported model is quite small now (~36MB) so I've just put it in a github repo currently: https://github.com/avinashvarna/sanskrit_nmt/tree/master/sandhi_split/transformer_small_vocab. I've also posted all scripts and detailed instructions on how to run the scripts if anyone is interested.

@codito Thanks for the feedback, I will look into github releases if the models get any bigger. Btw, have you used spacy before? Would it be of use to our project?

@dhamma-basti I initially ran into some problems when I had an old version (1.5.0) of tensorflow that I had previously installed on the system. Upgrading to the latest release (1.8.0) resolved those problems and I have not encountered any problems since. The model I trained took < 1hr to train on a GCE instance with 1 unit of a K80 GPU, but the performance may depend on the specific GPU. I have not tried training anything with the DCS data yet, but that is on my roadmap. Feel free to use the scripts in the above repo for your experiments if you want.

sebastian-nehrdich commented 6 years ago

Namaste,

I just want to point out that we now finished our paper on this and decided to make code and data available in public: https://github.com/OliverHellwig/sanskrit/tree/master/papers/2018emnlp This is about the best we can achieve for the time being. i think writing an efficient reimplementation of this with the transformer as a workhorse could bring some improvements (we mentioned our attempts on this in the paper, but I didn't follow it to it's full potencial). I think currently the performance is good enough to apply our model even without a GPU. With a Titan X on a Xeon I am able to infer ~200mb/hour, not bad at all. Precision clocks in at about 85% on sentence and 96% on word-level. However this one is just doing sandhi, no stemming!

kmadathil / sanskrit_parser

seq2seq based segmentation #85