ropensci / textreuse

Detect text reuse and document similarity
https://docs.ropensci.org/textreuse
197 stars 33 forks source link

Inconsistent skipping behavior in TextReuseCorpus #90

Open tylerandrewscott opened 4 years ago

tylerandrewscott commented 4 years ago

I am encountering an issue using the TextReuseCorpus function where I feed in a vector of texts (using the "text = " option in the function, and: (1) receive a warning of skipped texts due to insufficient length on character strings that should be long enough; and (2) get a different number of skip warnings each time. I am reading in a large vector (>300,000) of texts, ranging from 155 to 9900 characters, and usually 30k to 150k are skipped for being too short. I can take these same skipped strings, run TextReuseCorpus on them, and they'll be fine this time around. Perhaps I'm simply doing something wrong?

lmullen commented 4 years ago

Can you please provide a reproducible example?

tylerandrewscott commented 4 years ago

Hello,

I've been working on that, but that seems to be part of the issue - consistent problem but not reproducible cases. I'll keep toying with it and see if I can hone in.

Tyler

On Thu, Jan 16, 2020, 6:05 PM Lincoln Mullen notifications@github.com wrote:

Can you please provide a reproducible example?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ropensci/textreuse/issues/90?email_source=notifications&email_token=AA4G7XYSIGDMEFBXRV3P6UDQ6EG4VA5CNFSM4KH7KZ22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJGGHAQ#issuecomment-575431554, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4G7X3XNA3BF6LKSVCAIFLQ6EG4VANCNFSM4KH7KZ2Q .

tylerandrewscott commented 4 years ago

Following up -- I can't seem to generate a reproducible example, as the behavior is different every time, but I suspect that might point to an issue outside the package? The behavior occurs when the number of texts is above a certain threshold. For instance, I consistently get skip notices when n = 50k, but never when n = 25k.

Screen Shot 2020-01-16 at 10 08 35 PM

However, I can run the same code twice at 50k and get different sets of skipped values:

Screen Shot 2020-01-16 at 10 27 36 PM

Here is the session info:

Screen Shot 2020-01-16 at 10 08 16 PM