overlapping sentences with long texts exceeding max_token_per_batch

shon-otmazgin / fastcoref

MIT License

142 stars 25 forks source link

overlapping sentences with long texts exceeding max_token_per_batch #37

Open davidberenstein1957 opened 1 year ago

davidberenstein1957 commented 1 year ago

Hi,

I used to work a lot with coreference for longer texts and I think it would be a nice addition to overlap sentences to have a more robust model w.r.t. longer texts. I also want to work on this.

Regards, David

shon-otmazgin commented 1 year ago

Hello @davidberenstein1957,

To do overlap between sentences to have more attention between segments? if yes, recent works (I think the one using BERT for coreference) showed it is not necessary, also it comes with more computation time.

davidberenstein1957 commented 1 year ago

No overlap, as in that your entire text might not fit into (GPU) memory.

shon-otmazgin commented 1 year ago

can you share more details? if you set max_tokens_in_batch to your longest doc in the dataset is it still OOM?

davidberenstein1957 commented 1 year ago

Similarly, when you exceed the length of the 'max_tokens' for the transformer used, it might still be interesting to use the last 'x' sentences and use that prepended text for the next chunk, so that you can infer some knowledge from that batch and can later on merge the clusters if they contain the same spans.

shon-otmazgin commented 1 year ago

If I understand correctly, you want want to overlap between batches? I can't understand the benefit of it.

davidberenstein1957 commented 1 year ago

let's say you have a text of length 3x, where the maximum number of tokens in a single pass is 2x, then it might make sense to allow for passing this text in segments 1:2 and segments 2:3. Afterwards, you could re-align/merge coref clusters based on the overlapping sentences in segment 2.