plazi / GoldenGATE-Imagine

A GUI Tool For Freeing Text and Data from PDF Documents
Other
5 stars 0 forks source link

Token count #47

Closed millerjeremya closed 11 months ago

millerjeremya commented 11 months ago

I'm trying to work out how to interpret "Number of tokens" values aquired from SRS. I thought this would be a count of the number of tokens (~words) in a treatment. For this example, I get three different values: 7, 28, and 349. (For comparison, MS Word calculates 336 words in this treatment text) image

gsautter commented 11 months ago

I thought this would be a count of the number of tokens (~words) in a treatment

For individual treatments (aggregate "show individual values"), it is exactly that, even though a token can be all of a word, a number, or a punctuation mark (which might explain the discrepancy to the count Word comes up with).

However, after double-keying the treatment UUID from your screen shot (please include it in the text next time), I get this: https://tb.plazi.org/GgServer/srsStats/stats?outputFields=doc.uuid+doc.numTokens+bib.author+bib.year+tax.name+tax.rank&groupingFields=doc.uuid+doc.numTokens+bib.author+bib.year+tax.name+tax.rank&FP-doc.uuid=03CE87F5FF86725DFF13DDAC88A2FD3A&format=HTML

Specifically, there is only one token count for the treatment as a whole ... looks like what you selected was "Number of Tokens" in the "Section Data" field group, showing the number of tokens per subSubSection, not for the treatment as a whole ... the field for the latter count is in the "Treatment & User Data" field group, and it gives you the above.

millerjeremya commented 11 months ago

Ah, thanks! Just needed to get the right Token field.

gsautter commented 11 months ago

Yeah, there's two of them, one for the treatments as a whole, and one for the individual s(ubSubS)ections ... regarding the overall count, I tend to thing Word disregards the punctuation marks, which our data model does not.