Open tylerandrewscott opened 4 years ago
Can you please provide a reproducible example?
Hello,
I've been working on that, but that seems to be part of the issue - consistent problem but not reproducible cases. I'll keep toying with it and see if I can hone in.
Tyler
On Thu, Jan 16, 2020, 6:05 PM Lincoln Mullen notifications@github.com wrote:
Can you please provide a reproducible example?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ropensci/textreuse/issues/90?email_source=notifications&email_token=AA4G7XYSIGDMEFBXRV3P6UDQ6EG4VA5CNFSM4KH7KZ22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJGGHAQ#issuecomment-575431554, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4G7X3XNA3BF6LKSVCAIFLQ6EG4VANCNFSM4KH7KZ2Q .
Following up -- I can't seem to generate a reproducible example, as the behavior is different every time, but I suspect that might point to an issue outside the package? The behavior occurs when the number of texts is above a certain threshold. For instance, I consistently get skip notices when n = 50k, but never when n = 25k.
However, I can run the same code twice at 50k and get different sets of skipped values:
Here is the session info:
I am encountering an issue using the TextReuseCorpus function where I feed in a vector of texts (using the "text = " option in the function, and: (1) receive a warning of skipped texts due to insufficient length on character strings that should be long enough; and (2) get a different number of skip warnings each time. I am reading in a large vector (>300,000) of texts, ranging from 155 to 9900 characters, and usually 30k to 150k are skipped for being too short. I can take these same skipped strings, run TextReuseCorpus on them, and they'll be fine this time around. Perhaps I'm simply doing something wrong?