vanderlee / php-sentence

Simple text sentence splitting and counting. Supports atleast english, german and dutch, possibly more. If you find it works well enough for your language, please let me know!
MIT License
78 stars 23 forks source link

floatNumberClean tokenization issue #18

Closed lisfox1 closed 2 years ago

lisfox1 commented 2 years ago

Sentence::floatNumberClean doesn't work as intended, it replaces numbers inside tokens when multiple number needs to be tokenise, which causes tokens in the text to be wrong so they cannot be parse back to numbers on Sentence::floatNumberRevert. This issue also can cause Error: Allowed memory size of 536870912 bytes exhausted exception due to the size of the tokens.

Test:

$sentence = new Sentence();
$this->assertSame(["He got £2.","He lost £2.","He had £2."], $sentence->split("He got £2. He lost £2. He had £2."));

Expected:

[
  0 => 'He got £2.'
  1 => 'He lost £2.'
  2 => 'He had £2.'
]

Actual:

[
  0 => 'He got £c81e7c81e728d9d4c2f636f067f89cc14862c8d9d4cc81e728d9d4c2f636f067f89cc14862cf636f067f89cc1486c81e728d9d4c2f636f067f89cc14862cc.'
  1 => ' He lost £c81e7c81e728d9d4c2f636f067f89cc14862c8d9d4cc81e728d9d4c2f636f067f89cc14862cf636f067f89cc1486c81e728d9d4c2f636f067f89cc14862cc.'
  2 => ' He had £c81e7c81e728d9d4c2f636f067f89cc14862c8d9d4cc81e728d9d4c2f636f067f89cc14862cf636f067f89cc1486c81e728d9d4c2f636f067f89cc14862cc.'
]
pointedhat32167 commented 2 years ago

same issue here:

$text = 'If at 8:00 pm, do something, there is a good chance that by 8:45 pm we do something else. This is another sentence.'; $Sentence = new Sentence(); $Sentence->split($text);

Array ( [0] => If at c9f0f895fb98ab9159f51fd0297e236d:00 pm, do something, there is a good chance that by c9f0f895fb98ab9159f51fd0297e236d:45 pm we do something else. [1] => This is another sentence. )

vanderlee commented 2 years ago

Looking into this. It appears the numbers in hash is recursively replaced. Looking for a clean way to fix this.

vanderlee commented 2 years ago

Should be fixed in code. Have yet to bump version for package

pointedhat32167 commented 2 years ago

Thanks for fixing it! Updating the code fixed these texts, but not this one: "It comes from 11 to 12 years of age. It lasts from age 11 to about 15."

Array( [0] => It comes from 6512bd43d9caa6e02c990b0a82652dca to 12 years of age. [1] => It lasts from age 6512bd43d9caa6e02c990b0a82652dca to about 15. )

vanderlee commented 2 years ago

Was still doing some recursive replacing due to "12" appearing in the hash. Current code should do single replacements only, which prevent this. Also generalized the code a bit for some possible future improvements.

pointedhat32167 commented 2 years ago

Thanks!