ghpaetzold / questplusplus

Pipelined quality estimation.
49 stars 14 forks source link

Are the Giza resource files bi-directional? #30

Open bittlingmayer opened 8 years ago

bittlingmayer commented 8 years ago

Are the probabilities in a file like lang_resources/giza/lex.e2s meant to be used in both directions (English to Spanish AND Spanish to English)?

ghpaetzold commented 8 years ago

Hi Adam, I'm afraid not, for Spanish to English you will need a file with the inverse probabilities. Regards,


Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield

Date: Tue, 3 May 2016 02:53:27 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: Subject: [ghpaetzold/questplusplus] Are the Giza resource files bi-directional? (#30)

Are the probabilities in a file like lang_resources/giza/lex.e2s meant to be used in both directions (English to Spanish AND Spanish to English)?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub

    Livre de vírus. www.avast.com.         
bittlingmayer commented 8 years ago

Just to be clear, can I derive my lex.s2e file from the lex.e2s file? (ie reverse columns 0 and 1, re-order...) Do the probabilities still make sense that way?

Or would I need to generate it from the same parallel corpora by running the Giza++ pipeline?

And would the project welcome such a file?

ghpaetzold commented 8 years ago

I am not entirely familiar with the nature of probabilities produced by GIZA, but my guess is that the probabilities will not end up being the same for both directions. I suppose you could use the reverse probabilities for Spanish - English, but I cannot ensure that this will be the most prudent course of action to take in order to maximize performance.


Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield

Date: Tue, 3 May 2016 08:19:03 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: ghpaetzold@outlook.com; comment@noreply.github.com Subject: Re: [ghpaetzold/questplusplus] Are the Giza resource files bi-directional? (#30)

Just to be clear, can I derive my lex.s2e file from the lex.e2s file? (ie reverse columns 0 and 1, re-order...) Do the probabilities still make sense that way?

Or would I need to generate it from the same parallel corpora by running the Giza++ pipeline?

And would the project welcome such a file?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

    Livre de vírus. www.avast.com.         
bittlingmayer commented 8 years ago

(I haven't touched Giza in 10 years!) Is there any guidance on how that file was generated? And what the file extension would have been?

ghpaetzold commented 8 years ago

I know how you feel. :)You can find a tutorial made by myself here: http://www.opentag.com/okapi/wiki/index.php?title=GIZA%2B%2B_Installation_and_Running_Tutorial It teaches you how to create an Alignment Probability File, which is exactly what you need. Hope it helps!


Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield

Date: Tue, 3 May 2016 10:16:13 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: ghpaetzold@outlook.com; comment@noreply.github.com Subject: Re: [ghpaetzold/questplusplus] Are the Giza resource files bi-directional? (#30)

(I haven't touched Giza in 10 years!)

Is there any guidance on how that file was generated?

And what the file extension would have been?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

    Livre de vírus. www.avast.com.         
kashifshah commented 8 years ago

Hi,

The full procedure to get lex files is given at following link: http://www.statmt.org/moses/?n=FactoredTraining.HomePage

These files are produced at step 4. (i.e. both source to target and target to source lex files)

Best, Kashif

On Tue, May 3, 2016 at 6:22 PM, Gustavo Henrique Paetzold < notifications@github.com> wrote:

I know how you feel. :)You can find a tutorial made by myself here: http://www.opentag.com/okapi/wiki/index.php?title=GIZA%2B%2B_Installation_and_Running_Tutorial It teaches you how to create an Alignment Probability File, which is exactly what you need. Hope it helps!


Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield

Date: Tue, 3 May 2016 10:16:13 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: ghpaetzold@outlook.com; comment@noreply.github.com Subject: Re: [ghpaetzold/questplusplus] Are the Giza resource files bi-directional? (#30)

(I haven't touched Giza in 10 years!)

Is there any guidance on how that file was generated?

And what the file extension would have been?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

Livre de vírus. www.avast.com.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ghpaetzold/questplusplus/issues/30#issuecomment-216602549

lspecia commented 8 years ago

Hi Adam,

You cannot derive lex.s2e from the lex.e2s, you need to re-run GIZA with the inverse direction.

Lucia

On 3 May 2016 at 16:19, Adam Mathias Bittlingmayer <notifications@github.com

wrote:

Just to be clear, can I derive my lex.s2e file from the lex.e2s file? (ie reverse columns 0 and 1, re-order...) Do the probabilities still make sense that way?

Or would I need to generate it from the same parallel corpora by running the Giza++ pipeline?

And would the project welcome such a file?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ghpaetzold/questplusplus/issues/30#issuecomment-216561929

Lucia www.dcs.shef.ac.uk/~lucia/

bittlingmayer commented 8 years ago

Thanks all very much for all the guidance

@ghpaetzold In Step 6, I think -T and -S need to be swapped, no?

I also found it necessary to create a co-occurrence file: run snt2cooc.out [source].vcb [target].vcb [source_target].snt > cooc.cooc and then add to Step 6: -CoocurrenceFile cooc.cooc

ghpaetzold commented 8 years ago

Thanks Mathias, we will be revising our tutorial. :)


Gustavo Henrique PaetzoldPh.D. Candidate in Computer ScienceUniversity of Sheffield

Date: Tue, 10 May 2016 04:34:39 -0700 From: notifications@github.com To: questplusplus@noreply.github.com CC: ghpaetzold@outlook.com; mention@noreply.github.com Subject: Re: [ghpaetzold/questplusplus] Are the Giza resource files bi-directional? (#30)

Thanks all very much for all the guidance

@ghpaetzold In Step 6, I think -T and -S need to be swapped, no?

I also found it necessary to create a co-occurrence file:

run snt2cooc.out [source].vcb [target].vcb [source_target].snt > cooc.cooc

and then add to Step 6:

-CoocurrenceFile cooc.cooc

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

bittlingmayer commented 8 years ago

In case others want to skip repeating the Giza step altogether, I did (re-)find: http://www.quest.dcs.shef.ac.uk/quest_files/lex.s2e http://www.quest.dcs.shef.ac.uk/quest_files/lex.e2s

There is also German in the dir: http://www.quest.dcs.shef.ac.uk/quest_files/

For most EU languages there is http://metashare.tilde.com/repository/search/?q=Probabilistic+bilingual+dictionaries

(It just happens that English-Spanish is not working though, I've contacted them.)

carolscarton commented 8 years ago

In the webpage of WMT16 you can also find more resources (including Giza tables) for technical domain data: http://www.statmt.org/wmt16/quality-estimation-task.html