retextjs / retext

natural language processor powered by plugins part of the @unifiedjs collective
https://unifiedjs.com
MIT License
2.36k stars 93 forks source link

Issue when parsing some sentences ending with numbers #58

Closed alaa-eddine closed 3 years ago

alaa-eddine commented 4 years ago

Subject of the issue

I stumbled upon a strange case where retext fails to detect some sentences ending with number (but not all).

Your environment

Steps to reproduce

I try to parse the following string "Hello 30. Hello world."

const retext = require('retext');
const visit = require('unist-util-visit');
const toString = require('nlcst-to-string');

var str = 'Hello 30. Hello world.'

const tree = retext.parse(str);

visit(tree, 'SentenceNode', (node)=>{
    console.log('>', toString(node));
} );

Expected behaviour

output :

> Hello 30.
> Hello world.

Actual behaviour

output :

> Hello 30. Hello world.

Please note that if I test the same code with this string "Hello 3030. Hello world." , it works just fine.

it seems to happen with numbers with less than four digits .

alaa-eddine commented 4 years ago

Hello devs :) is there any update on this ? If you can give me some hints about the part of the code that parses numbers and sentences I can try to fix it.

wooorm commented 4 years ago

Whoops!

Hi there! đź‘‹ Sorry for the wait!

This comes from: https://github.com/wooorm/parse-latin/blob/6e606a372cdec62e1a71cbf2cfb4d5ca40797622/lib/plugin/merge-prefix-exceptions.js#L12

Even in text, people often use numbers followed by a dot for “lists”s: it could be one of two things: 1. this, or: 2. that. In those cases, the number + dot is not a break between sentences. So it is intentional, but I can see something in either option.

alaa-eddine commented 4 years ago

Hey @wooorm Thank you for your answer. Well I see the problem here, but the way it's implemented makes it difficult to fix without forking the repo + the dependencies. would it be possible to add a solution in retextjs to override those exceptions ?

wooorm commented 4 years ago

How would you suggest to fix it? Because then it would break the other cases: numbers followed by periods in text?

wooorm commented 3 years ago

Closing, natural language is really hard to classify with rule, and I can’t see how this could be fixed