shon-otmazgin / fastcoref

MIT License
142 stars 25 forks source link

LingMessCoref cannot handle long texts #47

Open teowz46 opened 9 months ago

teowz46 commented 9 months ago

Seems like LingMessCoref comes with a max_doc_len of 4096. Is there anyway to circumvent this for it to work on any document length?

Edit: I am mainly trying to get the character spans of clusters. I tried to overcome this limitation myself by manually chunking the document and piecing together the detected clusters. My chunks have about 4000 tokens each and have an overlap of 2000 tokens. To piece together the overall clusters, I get the coreference pairs for each chunk, and basically make a graph where these coreference pairs are edges, and each connected component is a cluster. However, this approach does not really work well because:

v5out commented 4 months ago

Also having this issue, had to fallback to FCoref.