shon-otmazgin / fastcoref

MIT License
149 stars 26 forks source link

LingMessCoref cannot handle long texts #47

Open teowz46 opened 11 months ago

teowz46 commented 11 months ago

Seems like LingMessCoref comes with a max_doc_len of 4096. Is there anyway to circumvent this for it to work on any document length?

Edit: I am mainly trying to get the character spans of clusters. I tried to overcome this limitation myself by manually chunking the document and piecing together the detected clusters. My chunks have about 4000 tokens each and have an overlap of 2000 tokens. To piece together the overall clusters, I get the coreference pairs for each chunk, and basically make a graph where these coreference pairs are edges, and each connected component is a cluster. However, this approach does not really work well because:

v5out commented 7 months ago

Also having this issue, had to fallback to FCoref.