machinalis / yalign

A sentence aligner for comparable corpora
Other
127 stars 31 forks source link

Problem with yalign-train #3

Closed kamwolk closed 9 years ago

kamwolk commented 10 years ago

Hello,

I am writting to you with ask for help. It is very important for me to gain some additional data for my MT systems. You tools seems to be great but it does not work for me.

I did installation from http://yalign.readthedocs.org/en/latest/installation.html#installing-from-pypi.

With the files from tutorial all works but not with other.

I took dictionary from phrase table of mine MT system and uses OpenSubtitles 2012 corpora from OPUS project you recommended in tutorial.

For some time yalign-train works (2-3 mins) than it becomes Killed. I have no idea why. Is there any way to display what causes the error?

Really hope you can help me out.

rafacarrascosa commented 10 years ago

Are you still having this problem? I case you do, could you share with step by step instructions to replicate the issue?

kamwolk commented 10 years ago

Hi,

I found the problem were some strange symbols in mine language that interrupted tokenization.

Now I got some other problems. First is that if I use more 50MB as training data it takes very long time. Is it normal?

Secondly do you have any experience with extraciotn from Wikipedia ?

Best regards

2014-06-06 15:56 GMT+02:00 rafacarrascosa notifications@github.com:

Are you still having this problem? I case you do, could you share with step by step instructions to replicate the issue?

— Reply to this email directly or view it on GitHub https://github.com/machinalis/yalign/issues/3#issuecomment-45338966.

rafacarrascosa commented 10 years ago

If you have a patch for the tokenization fix you solved I would be happy to accept it into Yalign.

About the time: Yes, it's normal, training takes a lot of time.

About extraction from Wikipedia: Kind-of-yeah, we evaluated Yalign against wikipedia while we were developing it...

kamwolk commented 10 years ago

I used instead this pipeline of scripts for tokenization: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/pre-tokenizer.perl https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl

Those are very easy to use and multilingual.

I created a script that allows to automatically download parallel Wiki articles in any language do some initial cleaning and automatically run alignment in parallel for much faster processing on files loaded from folder. If you wish I can share it.

Regarding Wikipedia for me it does not work to good. Maybe it is problem of data I am training on? I tried on TED Lectures ? Maybe some other corpora would be better for this task ? Can you recommend any ? Did you got any good results on that data on can you recommend any other good source ? I was thinking of something like Bootcat - downloading data by keywords in parallel but I am not sure it would work in this scenario. What is your opinion ?

2014-06-06 16:17 GMT+02:00 rafacarrascosa notifications@github.com:

If you have a patch for the tokenization fix you solved I would be happy to accept it into Yalign.

About the time: Yes, it's normal, training takes a lot of time.

About extraction from Wikipedia: Kind-of-yeah, we evaluated Yalign against wikipedia while we were developing it...

— Reply to this email directly or view it on GitHub https://github.com/machinalis/yalign/issues/3#issuecomment-45341182.

rafacarrascosa commented 10 years ago

Regarding Wikipedia for me it does not work to good

You mean quality of the alignments? There's a tradeoff (precision vs recall) you can tweak by hand if you want. On the model folder there is a file called metadata.json that has two configuration variables:

Both of this parameters are selected automatically during training, but if the tradeoff you are seeing is not what you were looking for you can simply edit metadata.json and change this values at will.

Did you got any good results on that data on can you recommend any other good source ?

Our test did ok with Wikipedia, we where ranging between %80 and 95% precision depending on the article (we didn't measure recall because it's expensive). And at first sight aligning subtitles for movies was even better (but we didn't evaluate quantitatively).

I was thinking of something like Bootcat

Nice tool, I didn't knew about it