hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
487 stars 59 forks source link

Improvements: virama and nukthas of Indic languages, easy way to specify basic protected patterns #103

Closed thammegowda closed 3 years ago

thammegowda commented 4 years ago

Changes proposed:

  1. Added viramas and nuktas of Indian languages -- don't butcher them
  2. -p :basic: to enable basic protected patters
  3. -p :web: to enable protected patterns for web : @user #hashtag user@host.com https://host.com?k1=v1&k2=v2
  4. added __main__.py so we can call python -m sacremoses ; Actually useful for development like PYTHONPATH=path/to/repo python -m sacremoses

P.S.

  1. these are some changes I made for my use case, it would be nice to merge it to main repo. If they aren't general enough, it's okay to not merge it, then I will have to maintain my own copy of this code :)
  2. I don't know how bad it harmed the performance with viramas, nukthas concatenated, and extra protected patterns.
alvations commented 4 years ago

@thammegowda this is an AWESOME contribution!!

Thank you so much for adding the viramas and nuktas.

Other than the inclusion of __main__, I've added some comments on the code could you help check them and made some light changes and we should be good to merge.

Regarding __main__, could I ask what is that used for and how you are currently using it? It might not be necessary since the click has did the heavy lifting to "CLI-ize" the functions =)

I also have some question on how :basic: and :web: might overlap. If you have any comments on it, please do suggest. Otherwise, I'll sit down and think it through a little with some tests on the regexes sets.


Note to self: this resolves: #42

thammegowda commented 4 years ago

@alvations Thanks for the comments. I was unable to make any corrections to the perl scripts (as I dont speak that language), but python implementation made it possible to modify/enhance easily. So, well done! And thanks again for your effort in porting perl to python!

thammegowda commented 4 years ago

The __main__.py enabled python -m sacremoses way of invoking and it is orthogonal to using click.

python -m sacremose provides the flexibility of choosing whichever python I want and directly controlling the PYTHONPATH

Here is how the call gets routed to click interface

$python -m sacremoses -h
Usage: __main__.py [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

Options:
  -l, --language TEXT      Use language specific rules when tokenizing
  -j, --processes INTEGER  No. of processes.
  -e, --encoding TEXT      Specify encoding of file.
  -q, --quiet              Disable progress bar.
  --version                Show the version and exit.
  -h, --help               Show this message and exit.

Commands:
  detokenize
  detruecase
  normalize
  tokenize
  train-truecase
  truecase

$ python -m sacremoses tokenize -h
Usage: __main__.py tokenize [OPTIONS]

Options:
  -a, --aggressive-dash-splits   Triggers dash split rules.
  -x, --xml-escape               Escape special characters for XML.
  -p, --protected-patterns TEXT  Specify file with patters to be protected in
                                 tokenisation. Special values: :basic: :web:

  -c, --custom-nb-prefixes TEXT  Specify a custom non-breaking prefixes file,
                                 add prefixes to the default ones from the
                                 specified language.

  -h, --help                     Show this message and exit.
thammegowda commented 4 years ago

@alvations I thought this pull request is good to merge. Please let me know if you are waiting for something from my side. Thanks

alvations commented 3 years ago

Thank you @thammegowda! Sorry for the very very late reply, had been a rough half a year.