nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
812 stars 84 forks source link

Specific string causes segment function to return empty array #126

Open NiftyliuS opened 6 months ago

NiftyliuS commented 6 months ago

Describe the bug A clear and concise description of what the bug is.

To Reproduce input_str = """This is part 3 of MAMI-san's hair timelineThe previous hair timelines can be found hereOkay then, we'll be continuing from last time and starting off with MAMI-san's orange era, which was at this time2014/8 - Continuing the orange bob2014/09 - Used to seeing the orangeOrange fades easily, so we did a lot of maintenence2014/10 - Her hair grew long with seal extensionsA new song was released while being orange2014/11 - Completely orange☝︎A new album was also released while being orange2014/12 - The ends and bangs were cut straight acrossHeading into 2015JanuaryMAMI-san is super stylish, isn't sheFebruary2015/03 - The extensions got a little shorterMAMI had been orange for exactly one yearIn April, before their world tour, we took out the orange and put in a turquoise blue gradient2015/05 - While on break from tour, we took out the ash color and did maintenence2015/062015/072015/08 - Returning from tour and cutting it into a bobReleasing a new song while having two-toned hair2015/09 - We put in ash with a transparent feeling to it2015/11 - Her first image change in eight months2015/12 - An easy-going color so that the pudding color isn't conspicuousHer black hair has been missed by the fans, hasn't itWhen will the dyed-black MAMI-san be woken up by another desire to bleach her hairAlso, it's not that her lovely hair and color were worn out; at RISEL we have a 『hybrid bleach』 of an original super bleach dream, so please don't worryI will make her hair colors beautiful enough so that everyone will be surprised by any color, and I will support her on behalf of everyone so that MAMI-san can give her best performanceLooking back, even though we changed intense colors so many times, everything definitely looks good on MAMI-san, doesn't itWell then, please look forward to MAMI-san's hair timeline again next yearRISEL.xoxo.KAZU"""

segmenter = pysbd.Segmenter(language="en", clean=False) segments = segmenter.segment(input_str )

Expected behavior Array of 1 or more sentences

Additional context The text originates from openwebtext dataset. I also found cases where it removes or adds spaces to sentences that were not in the original strings.

Empty array
NiftyliuS commented 6 months ago

On further digging in i found that this "☝" char simply breaks something - no errors in the console however...

As far as additional strings go:

demo_text = """For years Stephen Harper and his Cons have been slowly killing our Parliament.They have have debased it, they have rendered it impotent.They have reduced it to a scripted horror show, where every question is answered with an attack on the opposition.But yesterday with their ghastly leader out of the country they practically finished it off.For this is what happened when Tom Mulcair rose to ask this question about Canada’s mysterious mission in Iraq:Instead of any kind of answer, he got Paul Calandra, Stephen Harper’s Parliamentary Secretary.You know the Con clown . . .And this:Which as Aaron Wherry points out, was absurd enough. But what happened next, when Mulcair continued his questioning, and appealed to the Speaker Andrew Scheer to ask Calandra to follow the rules of democratic decency, turned our Parliament into a cheap FARCE . . .Yes, believe it or not, rather than ask the clown Calandra to answer the questions, or to stop turning Question Period into some kind of Con cabaret, Scheer punished Mulcair.Removing his two remaining questions, and moving on to Justin Trudeau.Even though Mulcair had every right to keep trying to get a serious answer about a very serious issue. He was merely asking Scheer to stop allowing Calandra to turn the House he looks down upon from his throne, into a bad joke or a fascist circus.Or the death of our democracy. Wherry:This is not quite rocket science. These are merely the hopey changey principles on which we aim to govern ourselves.And who can blame Mulcair for questioning Sheer’s neutrality? When he has made so many dubious decisions. And has from the moment he sat on his throne acted and sounded like a Con robot…Or his master’s voice.And all I can say is, before we have to hold a mirror up to the cold blue lips of our democracy, to see if it’s still alive.When we fire his maniac master . . ."""

this one adds space at the end of the last sentence

demo_text = """> > Field experience shows it successfully delivers new features to end users > without a global software upgrade. > The global upgrade is required for all full nodes in both types. If a full node doesn't upgrade then it no longer does what it was designed to do; if the user is OK with that, they should just run an SPV wallet or use blockchain.info or some other mechanism that consumes way fewer resources. But if you want the software you installed to achieve its stated goal, you *must* upgrade. There is no way around that. Jorge has said soft forks always lead to network convergence. No, they don't. You get constant mini divergences until everyone has upgraded, as opposed to a single divergence with a hard fork (until everyone has upgraded). The quantity of invalid blocks mined, on the other hand, is identical in both types. Adam has said "there is actually consensus", although I just said there isn't. Feel free to say what you really mean here Adam - there's consensus if you ignore people who don't agree, i.e. the concept of "developer consensus" doesn't actually mean anything. This would contradict your prior statements about how Bitcoin Core makes decisions, but alright .... Finally John, I fully agree with what you wrote. Debates that never end are bad news all round. Bitcoin Core has told the world it uses "developer consensus" to make decisions. I don't agree that's a good way to do things, but if Core wants to stick with it then there is no choice - as I am a developer, and I do not agree with the change, there is no consensus and the debate is over. Hey, I have an idea. Maybe we should organise a conference about soft vs hard forks. Let's have it down the road from where I live, a couple of weeks from now. Please submit your talk titles to me so I can vet them to ensure nobody does an offtopic talk ;) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linuxfoundation.org/pipermail/bitcoin-dev/attachments/20150930/5e0bab14/attachment.html>"""

and this one removes space after "....": ...but alright .... Finally John... => [" ...but alright.", "...","Finally...] it should be " Finally" not "Finally"