Closed parthsad closed 4 years ago
Looks like some of the early documentation might not have survived some recent changes. The command line works for this input.
$ scripts/sanskrit_parser sandhi matteBendraviBinnakumBapiSitagrAsEkabadDaspfhaH
Interpreting input loosely (strict_io set to false)
INFO Input String: matteBendraviBinnakumBapiSitagrAsEkabadDaspfhaH
INFO Input String in SLP1: matteBendraviBinnakumBapiSitagrAsEkabadDaspfhaH
Splits:
INFO Split: ['matteBa', 'indra', 'viBinna', 'kumBa', 'piSita', 'grAsa', 'eka', 'badDaspfhaH']
INFO Split: ['matteBa', 'indra', 'viBinna', 'kumBa', 'piSita', 'grAsa', 'Eka', 'badDaspfhaH']
INFO Split: ['matteBa', 'indra', 'vi', 'Binna', 'kumBa', 'piSita', 'grAsa', 'eka', 'badDaspfhaH']
INFO Split: ['matteBa', 'indra', 'viBinna', 'kumBa', 'pi', 'Sita', 'grAsa', 'eka', 'badDaspfhaH']
INFO Split: ['matteBa', 'indra', 'viBinna', 'kumBa', 'pi', 'Sita', 'grAsa', 'Eka', 'badDaspfhaH']
INFO Split: ['matteBa', 'indra', 'viBinna', 'kumBa', 'piSita', 'grAsA', 'eka', 'badDaspfhaH']
INFO Split: ['matteBa', 'indra', 'vi', 'Binna', 'kumBa', 'pi', 'Sita', 'grAsa', 'eka', 'badDaspfhaH']
INFO Split: ['matteBa', 'indra', 'viBinna', 'kumBa', 'pi', 'Sita', 'grAsA', 'eka', 'badDaspfhaH']
INFO Split: ['matteBa', 'indra', 'vi', 'Binna', 'kumBa', 'pi', 'Sita', 'grAsa', 'Eka', 'badDaspfhaH']
INFO Split: ['matteBa', 'indra', 'vi', 'Binna', 'kumBa', 'pi', 'Sita', 'grAsA', 'eka', 'badDaspfhaH']
sanskrit_parser.cmd_line.sandhi
would be a better prototype for your code.
Some of the api documentation needs to be relooked at.
Thanks @kmadathil ! I must say, on the whole I'm pretty impressed/happy that the parser produced a good quality split for such a long pada!
I tried with the cmd line tool and it worked. But with the entire verse -
sanskrit_parser sandhi "kzutkzAmopijarAkfSopiSiTilaprAyopikazwAMdaSAmApannopivipannadIDitirapiprAyezunaSyatsvapi matteBendravipinnakumBapiSitagrAsEkabadDaspFhaHkiMjIrRaMtfRamattimAnamahatAmagresaraHkesarI"
ParseResult('kzutkzAmopijarAkfSopiSiTilaprAyopikazwAMdaSAmApannopivipannadIDitirapiprAyezunaSyatsvapi matteBendravipinnakumBapiSitagrAsEkabadDaspFhaHkiMjIrRaMtfRamattimAnamahatAmagresaraHkesarI')
Traceback (most recent call last):
File "/usr/local/bin/sanskrit_parser", line 8, in
It failed. Not sure if there is an upper limit for the number of aksharas in the vAkya.
There is no upper limit in theory. We haven't tested for very long phrases. However, it should not crash in any case.
Spelling mistake :-) I corrected the verse, and it works. Nice choice of verse, @poojapi
$ scripts/sanskrit_parser sandhi "kzutkzAmopijarAkfSopiSiTilaprAyopikazwAMdaSAmApannopivipannadIDitirapiprAyezunaSyatsvapi matteBendraviBinnakumBapiSitagrAsEkabadDaspfhaHkiMjIrRaMtfRamattimAnamahatAmagresaraHkesarI"
The first few splits aren't the best possible. There's a line of attack there - why is that happening. Secondly, after the api change, we do not seem to gracefully signal that no split exists Thirdly, how do we avoid input errors like this? Should we output Devanagari as well so it can be easily checked?
@avinashvarna Thoughts?
@kmadathil Good catch! Part of the reason for the splits with 'U' etc is that MW dictionary (which is one of the pada sources we use) has a lot of these ekAkShara words. The lexical scorer produces a higher score for these compared to the longer words - likely a side-effect of using sentencepiece in the scoring. Maybe a combined score that weights the number of words with the lexical score could help, or improving the lexical scorer. In general, we need a better language model to score the splits.
We need to fix the second issue and gracefully handle the case when no split exists.
I am not sure we could do much about the third one in this package. Outputting devanagari may help in cases where it is supported - but this doesn't work well in certain terminals which can't handle unicode characters. The web interface outputs devanagari already, if I am not wrong. Maybe as part of gracefully handling the case where no splits are found, we print a message asking the user to carefully check the input for typos/errors?
Regarding a better approach for scoring, we already have #93 open to discuss options
I'll open another issue for gracefully handling the no-splits-available case.
Carefully checking isn't that easy, as we find here. I didn't notice the spelling error initially, and had to resort to good old binary search to debug and finally figure out that there was a spelling error :-) Perhaps an option to output devanagari instead of SLP1 would help, which could be turned on if the terminal does support it.
As you say, we may need to promote the web version more, but I hesitate to do that, because the backend is usually out-of-date. I'd started the work of creating a AWS Lambda backend, but haven't completed it.
I want to test it. How to?
Closing based on #129
@gasyoun - presuming you want to test vakya analyzer, it is preliminary right now, and can handle simple sentences. The best way to test it right now, while we get the web interface updated is to install it with pip install sanskrit_parser
, and run it as in the docs page https://kmadathil.github.io/sanskrit_parser/build/html/
I'll also open an issue for the web interface, and tag it "help wanted". Maybe some new contributor can help us with this.
I've opened #134
I tried the following as a sentence - "matteBendraviBinnakumBapiSitagrAsEkabadDaspfhaH" but getting following error
Traceback (most recent call last): File "sandhi_split_v2.py", line 14, in
for split in parse_result.splits(max_splits=10):
File "/usr/local/lib/python3.7/site-packages/sanskrit_parser/api.py", line 119, in splits
split_results = self.sandhi_graph.find_all_paths(max_paths=max_splits,
AttributeError: 'NoneType' object has no attribute 'find_all_paths'
Here, my sandhi_split_v2.py is taken from https://kmadathil.github.io/sanskrit_parser/build/html/sanskrit_parser_api.html