naver / splade

SPLADE: sparse neural search (SIGIR21, SIGIR22)
Other
751 stars 84 forks source link

A new fork to implement dataset triplets with doc ids instead of doc texts #40

Closed monilouise closed 5 months ago

cadurosar commented 1 year ago

Hello, this seems pretty cool thanks for the contribution! We are doing a major refactoring on the code (https://github.com/naver/splade/tree/hf/splade/hf) that seems to go in line with what you are doing here. It will allow for multiple ways of loading negatives/triplets (trec file, json, pickled dictionaries...) and also allows for training with more negatives/query in the same batch which has been seen to improve performance. It also completely shifts the training logic to huggingface so it would probably not integrate this very well.

Could you please take a look to see if it would match your need? If it doesn't I will take a look to merge this PR.

monilouise commented 1 year ago

Hi,

As far as I understand, this new branch hf still uses the same schema for representing triplets files, based on strings. But we need them to be represented by IDs, due to memory constraints. So it seems we still need the merge.

Thanks.

On Fri, Jun 23, 2023 at 10:23 AM Carlos Eduardo Rosar Kós Lassance < @.***> wrote:

Hello, this seems pretty cool thanks for the contribution! We are doing a major refactoring on the code ( https://github.com/naver/splade/tree/hf/splade/hf) that seems to go in line with what you are doing here. It will allow for multiple ways of loading negatives/triplets (trec file, json, pickled dictionaries...) and also allows for training with more negatives/query in the same batch which has been seen to improve performance. It also completely shifts the training logic to huggingface so it would probably not integrate this very well.

Could you please take a look to see if it would match your need? If it doesn't I will take a look to merge this PR.

— Reply to this email directly, view it on GitHub https://github.com/naver/splade/pull/40#issuecomment-1604281365, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFB6IHK3JJCGAV3NTKDKEFLXMWKFBANCNFSM6AAAAAAZQ4NZR4 . You are receiving this because you authored the thread.Message ID: @.***>

-- Monique Monteiro Twitter: http://twitter.com/monilouise

thibault-formal commented 5 months ago

hey, Sorry for the long delay. The new training code with HF allows you to use files with such a format. I close this PR. Thibault