Closed thammegowda closed 3 years ago
@thammegowda this is an AWESOME contribution!!
Thank you so much for adding the viramas
and nuktas
.
Other than the inclusion of __main__
, I've added some comments on the code could you help check them and made some light changes and we should be good to merge.
Regarding __main__
, could I ask what is that used for and how you are currently using it? It might not be necessary since the click
has did the heavy lifting to "CLI-ize" the functions =)
I also have some question on how :basic: and :web: might overlap. If you have any comments on it, please do suggest. Otherwise, I'll sit down and think it through a little with some tests on the regexes sets.
Note to self: this resolves: #42
@alvations Thanks for the comments. I was unable to make any corrections to the perl scripts (as I dont speak that language), but python implementation made it possible to modify/enhance easily. So, well done! And thanks again for your effort in porting perl to python!
The __main__.py
enabled python -m sacremoses
way of invoking and it is orthogonal to using click
.
python -m sacremose
provides the flexibility of choosing whichever python
I want and directly controlling the PYTHONPATH
Here is how the call gets routed to click
interface
$python -m sacremoses -h
Usage: __main__.py [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...
Options:
-l, --language TEXT Use language specific rules when tokenizing
-j, --processes INTEGER No. of processes.
-e, --encoding TEXT Specify encoding of file.
-q, --quiet Disable progress bar.
--version Show the version and exit.
-h, --help Show this message and exit.
Commands:
detokenize
detruecase
normalize
tokenize
train-truecase
truecase
$ python -m sacremoses tokenize -h
Usage: __main__.py tokenize [OPTIONS]
Options:
-a, --aggressive-dash-splits Triggers dash split rules.
-x, --xml-escape Escape special characters for XML.
-p, --protected-patterns TEXT Specify file with patters to be protected in
tokenisation. Special values: :basic: :web:
-c, --custom-nb-prefixes TEXT Specify a custom non-breaking prefixes file,
add prefixes to the default ones from the
specified language.
-h, --help Show this message and exit.
@alvations I thought this pull request is good to merge. Please let me know if you are waiting for something from my side. Thanks
Thank you @thammegowda! Sorry for the very very late reply, had been a rough half a year.
Changes proposed:
viramas
andnuktas
of Indian languages -- don't butcher them-p :basic:
to enable basic protected patters-p :web:
to enable protected patterns for web : @user #hashtag user@host.com https://host.com?k1=v1&k2=v2__main__.py
so we can callpython -m sacremoses
; Actually useful for development likePYTHONPATH=path/to/repo python -m sacremoses
P.S.