pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 810 forks source link

Hard negatives for Translation ranking task #686

Closed thak123 closed 4 years ago

thak123 commented 4 years ago

I am trying to implement Guo et al 2018 bi-text retrieval setup for the translation ranking task.

I want to sample hard negatives for the translation pairs.

I was wondering if this is easily achievable using torchtext.

Any pointers will be highly helpful .

zhangguanheng66 commented 4 years ago

I'm not familiar with the topic. Could you explain me with more details?

It should be note that the translation datasets in torchtext now is still in the old abstraction. We plan to re-write them in the next release. Without the new abstraction, you probably don't have a lot of flexibility here. But we could still try to help.

thak123 commented 4 years ago

okay. so here it goes. Assume Batch.src and Batch.trg provided from the torchtext example with Multi30k as the dataset.

from source to target there is one to one correspondence between a given sample [x_1........x_n] <-> [y_1.....y_m]

where x and y are the source and target sentences fed into the ML system.

The aim is to convert the dataset into [ [[x_1,,,,,,x_n],[y_1.....y_m],1], [[x_1,,,,,,x_j],[y_1,.....y_m],0] ] where the label(last column) denotes if two strings are translation of each other or not. 1 means there are and 0 means it is a negative sample. Hard negative or soft negative is again a new question, which can be skipped for now.

So give a batch of few sentences how to create this type of representation.

zhangguanheng66 commented 4 years ago

@thak123 Thanks for the clarification.

For the translation pair (a.k.a. "1"), you get it directly from the datasets. For the non-translation pair (a.k.a "0"), you could save the original samples in a list and shuffle them to have the randomness. Then, you could label them with "0"

Not sure if it helps or not. :)

thak123 commented 4 years ago

@zhangguanheng66 Thanks for the advice. I am planning on implementing my custom solution for the same.