Open bittlingmayer opened 8 years ago
Hi Adam, I'm afraid not, for Spanish to English you will need a file with the inverse probabilities. Regards,
Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield
Date: Tue, 3 May 2016 02:53:27 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: Subject: [ghpaetzold/questplusplus] Are the Giza resource files bi-directional? (#30)
Are the probabilities in a file like lang_resources/giza/lex.e2s meant to be used in both directions (English to Spanish AND Spanish to English)?
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub
Livre de vírus. www.avast.com.
Just to be clear, can I derive my lex.s2e file from the lex.e2s file? (ie reverse columns 0 and 1, re-order...) Do the probabilities still make sense that way?
Or would I need to generate it from the same parallel corpora by running the Giza++ pipeline?
And would the project welcome such a file?
I am not entirely familiar with the nature of probabilities produced by GIZA, but my guess is that the probabilities will not end up being the same for both directions. I suppose you could use the reverse probabilities for Spanish - English, but I cannot ensure that this will be the most prudent course of action to take in order to maximize performance.
Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield
Date: Tue, 3 May 2016 08:19:03 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: ghpaetzold@outlook.com; comment@noreply.github.com Subject: Re: [ghpaetzold/questplusplus] Are the Giza resource files bi-directional? (#30)
Just to be clear, can I derive my lex.s2e file from the lex.e2s file? (ie reverse columns 0 and 1, re-order...) Do the probabilities still make sense that way?
Or would I need to generate it from the same parallel corpora by running the Giza++ pipeline?
And would the project welcome such a file?
— You are receiving this because you commented. Reply to this email directly or view it on GitHub
Livre de vírus. www.avast.com.
(I haven't touched Giza in 10 years!) Is there any guidance on how that file was generated? And what the file extension would have been?
I know how you feel. :)You can find a tutorial made by myself here: http://www.opentag.com/okapi/wiki/index.php?title=GIZA%2B%2B_Installation_and_Running_Tutorial It teaches you how to create an Alignment Probability File, which is exactly what you need. Hope it helps!
Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield
Date: Tue, 3 May 2016 10:16:13 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: ghpaetzold@outlook.com; comment@noreply.github.com Subject: Re: [ghpaetzold/questplusplus] Are the Giza resource files bi-directional? (#30)
(I haven't touched Giza in 10 years!)
Is there any guidance on how that file was generated?
And what the file extension would have been?
— You are receiving this because you commented. Reply to this email directly or view it on GitHub
Livre de vírus. www.avast.com.
Hi,
The full procedure to get lex files is given at following link: http://www.statmt.org/moses/?n=FactoredTraining.HomePage
These files are produced at step 4. (i.e. both source to target and target to source lex files)
Best, Kashif
On Tue, May 3, 2016 at 6:22 PM, Gustavo Henrique Paetzold < notifications@github.com> wrote:
I know how you feel. :)You can find a tutorial made by myself here: http://www.opentag.com/okapi/wiki/index.php?title=GIZA%2B%2B_Installation_and_Running_Tutorial It teaches you how to create an Alignment Probability File, which is exactly what you need. Hope it helps!
Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield
Date: Tue, 3 May 2016 10:16:13 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: ghpaetzold@outlook.com; comment@noreply.github.com Subject: Re: [ghpaetzold/questplusplus] Are the Giza resource files bi-directional? (#30)
(I haven't touched Giza in 10 years!)
Is there any guidance on how that file was generated?
And what the file extension would have been?
— You are receiving this because you commented. Reply to this email directly or view it on GitHub
Livre de vírus. www.avast.com.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ghpaetzold/questplusplus/issues/30#issuecomment-216602549
Hi Adam,
You cannot derive lex.s2e from the lex.e2s, you need to re-run GIZA with the inverse direction.
Lucia
On 3 May 2016 at 16:19, Adam Mathias Bittlingmayer <notifications@github.com
wrote:
Just to be clear, can I derive my lex.s2e file from the lex.e2s file? (ie reverse columns 0 and 1, re-order...) Do the probabilities still make sense that way?
Or would I need to generate it from the same parallel corpora by running the Giza++ pipeline?
And would the project welcome such a file?
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ghpaetzold/questplusplus/issues/30#issuecomment-216561929
Lucia www.dcs.shef.ac.uk/~lucia/
Thanks all very much for all the guidance
@ghpaetzold In Step 6, I think -T and -S need to be swapped, no?
I also found it necessary to create a co-occurrence file:
run snt2cooc.out [source].vcb [target].vcb [source_target].snt > cooc.cooc
and then add to Step 6:
-CoocurrenceFile cooc.cooc
Thanks Mathias, we will be revising our tutorial. :)
Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield
Date: Tue, 10 May 2016 04:34:39 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: ghpaetzold@outlook.com; mention@noreply.github.com Subject: Re: [ghpaetzold/questplusplus] Are the Giza resource files bi-directional? (#30)
Thanks all very much for all the guidance
@ghpaetzold In Step 6, I think -T and -S need to be swapped, no?
I also found it necessary to create a co-occurrence file:
run snt2cooc.out [source].vcb [target].vcb [source_target].snt > cooc.cooc
and then add to Step 6:
-CoocurrenceFile cooc.cooc
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
In case others want to skip repeating the Giza step altogether, I did (re-)find: http://www.quest.dcs.shef.ac.uk/quest_files/lex.s2e http://www.quest.dcs.shef.ac.uk/quest_files/lex.e2s
There is also German in the dir: http://www.quest.dcs.shef.ac.uk/quest_files/
For most EU languages there is http://metashare.tilde.com/repository/search/?q=Probabilistic+bilingual+dictionaries
(It just happens that English-Spanish is not working though, I've contacted them.)
In the webpage of WMT16 you can also find more resources (including Giza tables) for technical domain data: http://www.statmt.org/wmt16/quality-estimation-task.html
Are the probabilities in a file like lang_resources/giza/lex.e2s meant to be used in both directions (English to Spanish AND Spanish to English)?