nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
782 stars 82 forks source link

Incorrect text span start and end returned #53

Closed dakinggg closed 4 years ago

dakinggg commented 4 years ago

Found another example that screws up the start and end indices for the spans. I wasn't really able to reproduce it without the full text, although probably there is a smaller version of the text that doesn't break it. It also has something to do with the <|CITE|> tokens.

In[41]: text = 'Trust in journalism is not associated with frequency of media use (except in the case of television as mentioned above), indicating that trust is not an important predictor of media use, though it might have an important impact on information processing. This counterintuitive fi nding can be explained by taking into account the fact that audiences do not watch informative content merely to inform themselves; they have other motivations that might override credibility concerns. For example, they might follow media primarily for entertainment purposes and consequently put less emphasis on the quality of the received information.As <|CITE|> have claimed, audiences tend to approach and process information differently depending on the channel; they approach television primarily for entertainment and newspapers primarily for information. This has implications for trust as well since audiences in an entertainment processing mode will be less attentive to credibility cues, such as news errors, than those in an information processing mode (Ibid.). <|CITE|> research confi rms this claim -he found that audiences tend to approach newspaper reading more actively than television viewing and that credibility assessments differ regarding whether audience members approach news actively or passively. These fi ndings can help explain why we found a weak positive correlation between television news exposure and trust in journalism. It could be that audiences turn to television not because they expect the best quality information but rather the opposite -namely, that they approach television news less critically, focus less attention on credibility concerns and, therefore, develop a higher degree of trust in journalism. The fact that those respondents who follow the commercial television channel POP TV and the tabloid Slovenske Novice exhibit a higher trust in journalistic objectivity compared to those respondents who do not follow these media is also in line with this interpretation. The topic of Janez Janša and exposure to media that are favourable to him and his SDS party is negatively connected to trust in journalism. This phenomenon can be partly explained by the elaboration likelihood model <|CITE|> , according to which highly involved individuals tend to process new information in a way that maintains and confi rms their original opinion by 1) taking information consistent with their views (information that falls within a narrow range of acceptance) as simply veridical and embracing it, and 2) judging counter-attitudinal information to be the product of biased, misguided or ill-informed sources and rejecting it <|CITE|> <|CITE|> . Highly partisan audiences will, therefore, tend to react to dissonant information by lowering the trustworthiness assessment of the source of such information.'

In[42]: import pysbd
In[43]: segmenter = pysbd.Segmenter(char_span=True)
In[44]: char_spans = segmenter.segment(text)
In[45]: char_spans[-1]
Out[45]: TextSpan(sent=<correct sentence text>, start=143, end=302)
In[46]: char_spans[-2]
Out[46]: TextSpan(sent=<correct sentence text>, start=0, end=142)
In[47]: char_spans[-3]
Out[47]: TextSpan(sent=<correct sentence text>, start=0, end=152)
In[48]: char_spans[-4]
Out[48]: TextSpan(sent=<correct sentence text>, start=2139, end=2368)
nipunsadvilkar commented 4 years ago

Fixed #63