groverjeenu / Bilingual-Word-Embeddings-with-Bucketed-CNN-for-Parallel-Sentence-Extraction

Code for our paper in ACL 2017
MIT License
13 stars 2 forks source link

Transferring code to a different problem #2

Open sebastian-nehrdich opened 6 years ago

sebastian-nehrdich commented 6 years ago

Hello there,

My name is Sebastian and I am currently working on a system that can find parallel sentences across the Sanskrit and Tibetan language. I already have segmented texts for both languages and created word vectors for both languages that are aligned to each other. apart from the aligned word vectors and some longer dictionaries I do not have any parallel data. Do you think your algorithm can be applied to my problem? I already looked at the code but it looks rather complicated to me. With best wishes and thank you for any suggestion,

Sebastian

groverjeenu commented 6 years ago

Hi Sebastian,

Since mine is a supervised approach, you would need to have some aligned data for training the CNNs.

I hope that helps.

Regards,

Jeenu Grover

On Fri, Mar 23, 2018, 17:09 Sebastian Nehrdich notifications@github.com wrote:

Hello there,

My name is Sebastian and I am currently working on a system that can find parallel sentences across the Sanskrit and Tibetan language. I already have segmented texts for both languages and created word vectors for both languages that are aligned to each other. apart from the aligned word vectors and some longer dictionaries I do not have any parallel data. Do you think your algorithm can be applied to my problem? I already looked at the code but it looks rather complicated to me. With best wishes and thank you for any suggestion,

Sebastian

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/groverjeenu/Bilingual-Word-Embeddings-with-Bucketed-CNN-for-Parallel-Sentence-Extraction/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AGwtV2vjPDBlsVkecltciwTSp1XZRA5Mks5thN7mgaJpZM4S4k9d .

sebastian-nehrdich commented 6 years ago

Hello Jeenu,

Thank you for your reply! Parallel resources are scarce, the best we can get are a few k of aligned sentences, might be still less than 20k I am afraid. Do you think that something can be done with such a low amount or is it rather hopeless? With best wishes,

Sebastian

On Fri, Mar 23, 2018 at 12:46 PM, Jeenu Grover notifications@github.com wrote:

Hi Sebastian,

Since mine is a supervised approach, you would need to have some aligned data for training the CNNs.

I hope that helps.

Regards,

Jeenu Grover

On Fri, Mar 23, 2018, 17:09 Sebastian Nehrdich notifications@github.com wrote:

Hello there,

My name is Sebastian and I am currently working on a system that can find parallel sentences across the Sanskrit and Tibetan language. I already have segmented texts for both languages and created word vectors for both languages that are aligned to each other. apart from the aligned word vectors and some longer dictionaries I do not have any parallel data. Do you think your algorithm can be applied to my problem? I already looked at the code but it looks rather complicated to me. With best wishes and thank you for any suggestion,

Sebastian

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/groverjeenu/Bilingual-Word- Embeddings-with-Bucketed-CNN-for-Parallel-Sentence-Extraction/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/ AGwtV2vjPDBlsVkecltciwTSp1XZRA5Mks5thN7mgaJpZM4S4k9d .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/groverjeenu/Bilingual-Word-Embeddings-with-Bucketed-CNN-for-Parallel-Sentence-Extraction/issues/2#issuecomment-375634092, or mute the thread https://github.com/notifications/unsubscribe-auth/AR4vmHa2Xi2U3_AB6jN_Y7oMWiVkbLi-ks5thOCZgaJpZM4S4k9d .

groverjeenu commented 6 years ago

20k is rather a good size to train CNN I believe. You can try using them.

On Fri, Mar 23, 2018, 17:19 Sebastian Nehrdich notifications@github.com wrote:

Hello Jeenu,

Thank you for your reply! Parallel resources are scarce, the best we can get are a few k of aligned sentences, might be still less than 20k I am afraid. Do you think that something can be done with such a low amount or is it rather hopeless? With best wishes,

Sebastian

On Fri, Mar 23, 2018 at 12:46 PM, Jeenu Grover notifications@github.com wrote:

Hi Sebastian,

Since mine is a supervised approach, you would need to have some aligned data for training the CNNs.

I hope that helps.

Regards,

Jeenu Grover

On Fri, Mar 23, 2018, 17:09 Sebastian Nehrdich <notifications@github.com

wrote:

Hello there,

My name is Sebastian and I am currently working on a system that can find parallel sentences across the Sanskrit and Tibetan language. I already have segmented texts for both languages and created word vectors for both languages that are aligned to each other. apart from the aligned word vectors and some longer dictionaries I do not have any parallel data. Do you think your algorithm can be applied to my problem? I already looked at the code but it looks rather complicated to me. With best wishes and thank you for any suggestion,

Sebastian

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/groverjeenu/Bilingual-Word- Embeddings-with-Bucketed-CNN-for-Parallel-Sentence-Extraction/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/ AGwtV2vjPDBlsVkecltciwTSp1XZRA5Mks5thN7mgaJpZM4S4k9d .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/groverjeenu/Bilingual-Word-Embeddings-with-Bucketed-CNN-for-Parallel-Sentence-Extraction/issues/2#issuecomment-375634092 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AR4vmHa2Xi2U3_AB6jN_Y7oMWiVkbLi-ks5thOCZgaJpZM4S4k9d

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/groverjeenu/Bilingual-Word-Embeddings-with-Bucketed-CNN-for-Parallel-Sentence-Extraction/issues/2#issuecomment-375634646, or mute the thread https://github.com/notifications/unsubscribe-auth/AGwtV72IX4OhxAtmiSvGdEzGGccu9oWxks5thOE5gaJpZM4S4k9d .

sebastian-nehrdich commented 6 years ago

That sound great. I am currently in the process of scraping the data. Do you have any idea about how well the CNN is able to deal with noise in this application? Do you think it is better to go for a smaller dataset and check it very well or to have a big dataset and tolerate that here and there something will not be perfect? After revising the data carefully I found out that only about 9,000 sentence pairs are immediately useful, the rest has to be revised. Especially poetry is problematic, because the Tibetan translations tend to be very liberal at that point. I also wonder if the code has to be tuned to the characteristics of Sanskrit and Tibetan, because they are different to EN/DE after all.

groverjeenu commented 6 years ago

Hi Sebastian,

I can't say much about that, only the experiments would be able to tell these things.

Regards, Jeenu Grover

On Wed, Mar 28, 2018, 19:31 Sebastian Nehrdich notifications@github.com wrote:

That sound great. I am currently in the process of scraping the data. Do you have any idea about how well the CNN is able to deal with noise in this application? Do you think it is better to go for a smaller dataset and check it very well or to have a big dataset and tolerate that here and there something will not be perfect? After revising the data carefully I found out that only about 9,000 sentence pairs are immediately useful, the rest has to be revised. Especially poetry is problematic, because the Tibetan translations tend to be very liberal at that point. I also wonder if the code has to be tuned to the characteristics of Sanskrit and Tibetan, because they are different to EN/DE after all.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/groverjeenu/Bilingual-Word-Embeddings-with-Bucketed-CNN-for-Parallel-Sentence-Extraction/issues/2#issuecomment-376897442, or mute the thread https://github.com/notifications/unsubscribe-auth/AGwtV-tCR1_yKTShlFIpEhekrJP5jNIwks5ti5elgaJpZM4S4k9d .