grantjenks / python-wordsegment

English word segmentation, written in pure-Python, and based on a trillion-word corpus.
http://www.grantjenks.com/docs/wordsegment/
Other
365 stars 48 forks source link

feature_request(mode): preserve all punctuation marks #31

Open Kristinita opened 3 years ago

Kristinita commented 3 years ago

1. Summary

It would be nice, if WordSegment at least at CLI mode will have the option to preserve all punctuation marks: ., ,, and so on.

2. Problem

  1. Scientific article example
  2. Scientific book example

Try copy and paste text from these article and book.

  1. The article:

    Sharing-economyfirmsdifferfromold-powerfirmsbecausetheformertypicallyareexponentialnew-powerorganisationscharac-terisedbyPorter’scompetitiveforces.Althoughsomenew-powerfirmsmaychoosenottoembraceastakeholderfocus,stakehold-ersandothernew-powerfirmswillpunishsuchchoices.Inotherwords,counterargumentstothesharingeconomy’sstakeholderpo-tentialbasedonthequestionableactionsofsomenew-powerfirmsareovershadowedbyothernew-powerfirmsandtheirstakehold-ers’actions.

  2. The book:

    Accordingto DavidAllen,authorof the bestsellerGettingThingsDone(2001),informationprofessionalshavea hardtimeaccomplishingtasksbecauseour workis inherentlyambiguous,we takeon too manycommit-ments,andwe cannotprioritizethe bestthingto do fromthe manychoicesbeforeus. J. WesleyCochran(1992),JudithSiess(2002),SamanthaHines(2010),andotherauthorsof timemanagementtreatisesfor librar-iansconcurthatlibrarieshavebeendifficultplacesto workfor years,especiallygivenour complexworkprocessesandoftenintangibleprod-ucts.Nevertheless,we havethe abilityas individualsto adoptbetterstrategiesto managethe everydaychaos.

Yes, ideally, of course, it would be nice normally add a text layer to the PDF, but I’m not making these articles and books. From my experience, I can say that a text layer without spaces like this is a common problem. The routine work of separating words can be time-consuming.

3. Behavior

3.1. Current

CLI usage:

sharing economy firms differ from old power firms because the former typically are exponential new power organisations characterised by porters competitive forces although some new power firms may choose not to embrace a stakeholder focus stakeholders and other new power firms will punish such choices in other words counterarguments to the sharing economy s stakeholder potential based on the questionable actions of some new power firms are overshadowed by other new power firms and their stakeholders actions

according to david allen author of the best seller getting things done 2001 information professionals have a hard time accomplishing tasks because our work is inherently ambiguous we take on too many commitments and we can not prioritize the best thing to do from the many choices before us j wesley cochran 1992judithsiess2002 samantha hines2010 and other authors of time management treatises for librarians concur that libraries have been difficult places to work for years especially given our complex work processes and often intangible products nevertheless we have the ability as individuals to adopt better strategies to manage the everyday chaos

Punctuation marks are stripped. Users have to do a lot of routine work to get them back.

3.2. Expected behavior

Ordinary English texts:

Sharing economy firms differ from old power firms because the former typically are exponential new power organisations characterised by Porter’s competitive forces. Although some new power firms may choose not to embrace a stakeholder focus, stakeholders and other new power firms will punish such choices. In other words, counterarguments to the sharing economy’s stakeholder potential based on the questionable actions of some new power firms are overshadowed by other new power firms and their stakeholders’ actions.

According to David Allen, author of the bestseller Getting Things Done(2001), information professionals have a hard time accomplishing tasks because our work is inherently ambiguous, we take on too many commitments, and we can not prioritize the best thing to do from the many choices before us. J. Wesley Cochran(1992), Judith Siess(2002), Samantha Hines(2010) and other authors of time management treatises for librarians concur that libraries have been difficult places to work for years, especially given our complex work processes and often intangible products. Nevertheless, we have the ability as individuals to adopt better strategies to manage the everyday chaos.

Thanks.

grantjenks commented 3 years ago

Use a regex to break the input into chunks separated by punctuation, then segment each chunk and combine the results by punctuation. The punctuation adds meaningful segmentation hints so stripping it out will reduce the quality. Segmentation works best on smaller phrases anyway.

grantjenks commented 3 years ago

The strategy also applies to capitalization.