Open haakym opened 7 years ago
To be honest, my understanding of the regex is pretty limited (as this was a method described here: http://php.net/manual/en/function.str-word-count.php#107363), but I have been having persistent problems with accuracy along these lines - I'm trialing various improvements but thank you for the heads up on this issue. I'll have more of a chance to play around with these tomorrow.
Great stuff.
Regarding regex, I'm in the same boat as you. I've always found this website helpful, I hope it may help you too: http://regexr.com.
Another point I forgot to mention on how the word count is calculated is that the preg_split()
is, I assume, splitting text where it would find a space/new line/paragraph etc. into an array then the array count is returned as the word count. When a document being read beings/ends with a space/new line it may add one more element to the array because of this.
Here's a simple example illustrating this using explode...
>>> $str = ',a,b,c,';
=> ",a,b,c,"
>>> $array = explode(',', $str);
=> [
"",
"a",
"b",
"c",
"",
]
>>> count($array)
=> 5
I think this might be why the pdf word count returns +2 words and starts and ends with ""
That makes sense to me. In fact, that's incredibly useful.
I created a doc and pdf file with the same text contents for testing purpose of this package and I found I got the following differing word counts:
I believe this is happening because of the regex used to split the text for the .doc file which appears to ignore the following string:
<br />
and because there are 5 paragraphs in the text there are 5x<br />
which results in five more words. The regex in question is used in the following method:Here is the text content inside the documents:
When running the following code for the .doc file
the value of
$docWordCount
is as follows:please note the
<br />
. You might want to copy and paste into a code editor to inspect properly.When running the following code for the .pdf file
the value of
$docWordCount
is as follows: