neosyon / SimpTextAlign

Repo for the simplified text alignment tools.
MIT License
21 stars 7 forks source link

Paragraph headings not recognised as sentences #1

Closed feralvam closed 6 years ago

feralvam commented 6 years ago

Hi, When running the tool with default parameters, I get alignments such as these:

11: ## Following In Some Generous Footsteps In 2010 , Bill Gates and Warren Buffett publicly launched the Giving Pledge to encourage billionaires to donate the bulk of their wealth to charity . ---(0.8202991224743326)---> 9: In 2010 , Bill Gates and Warren Buffett publicly launched the Giving Pledge to encourage billionaires to donate the bulk of their wealth to charity .

25: ## Using Technology To Change Learning In the open letter , Zuckerberg and Chan talked about the potential that technology offers to re-engineer the way children learn . ---(0.8480297902555115)---> 20: In the open letter , Zuckerberg and Chan talked about the potential that technology offers to re-engineer the way children learn .

As can be seen, the section headings "## Following In Some Generous Footsteps" and "## Using Technology To Change Learning" were not correctly identified as sentences, despite being in a single line in the Newsela article.

Is there a way to prevent this from happening? Maybe changing some property of the sentence splitter internally used by the tool? I haven't checked if this happens with every section heading in every article, but it does happen in all the ones I've manually checked (around 10 original articles with their 4 versions).

Thank you for your help, Fernando

neosyon commented 6 years ago

Dear Fernando,

Thank you for pointing out this issue. We had a problem with the sentence splitter.

Now it's fixed. Let us know if you find any other issue.

Best

sstajner commented 6 years ago

Dear Fernando,

Another way to deal with the ## issue is to exclude lines that start with

from the whole corpus before feeding it into our tool. We noticed that

those subtitles are usually either left unchanged or completely eliminated in the simplified versions, so they do not really contribute to the TS dataset built in this way.

Best regards, Sanja

On 15 June 2018 at 13:48, Marc F. S. notifications@github.com wrote:

Dear Fernando,

Thank you for pointing out this issue. We had a problem with the sentence splitter.

Now it's fixed. Let us know if you find any other issue.

Best, Marc

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/neosyon/SimpTextAlign/issues/1#issuecomment-397596819, or mute the thread https://github.com/notifications/unsubscribe-auth/APVqcyfslX7dcfGaePSChtWgpeBctZWMks5t858RgaJpZM4Un7so .

feralvam commented 6 years ago

@neosyon Thank you.

@sstajner Thanks for the advise. I think I will follow this approach because, as you've mentioned, those ## lines don't really add much information.