kmadathil / sanskrit_parser

Parsers for Sanskrit / संस्कृतम्
MIT License
69 stars 21 forks source link

Vakya Analyser Error #126

Closed parthsad closed 4 years ago

parthsad commented 4 years ago

I tried the following as a sentence - "matteBendraviBinnakumBapiSitagrAsEkabadDaspfhaH" but getting following error

Traceback (most recent call last): File "sandhi_split_v2.py", line 14, in for split in parse_result.splits(max_splits=10): File "/usr/local/lib/python3.7/site-packages/sanskrit_parser/api.py", line 119, in splits split_results = self.sandhi_graph.find_all_paths(max_paths=max_splits, AttributeError: 'NoneType' object has no attribute 'find_all_paths'

Here, my sandhi_split_v2.py is taken from https://kmadathil.github.io/sanskrit_parser/build/html/sanskrit_parser_api.html

kmadathil commented 4 years ago

Looks like some of the early documentation might not have survived some recent changes. The command line works for this input.

$ scripts/sanskrit_parser sandhi matteBendraviBinnakumBapiSitagrAsEkabadDaspfhaH
Interpreting input loosely (strict_io set to false)
INFO     Input String: matteBendraviBinnakumBapiSitagrAsEkabadDaspfhaH
INFO     Input String in SLP1: matteBendraviBinnakumBapiSitagrAsEkabadDaspfhaH
Splits:
INFO     Split: ['matteBa', 'indra', 'viBinna', 'kumBa', 'piSita', 'grAsa', 'eka', 'badDaspfhaH']
INFO     Split: ['matteBa', 'indra', 'viBinna', 'kumBa', 'piSita', 'grAsa', 'Eka', 'badDaspfhaH']
INFO     Split: ['matteBa', 'indra', 'vi', 'Binna', 'kumBa', 'piSita', 'grAsa', 'eka', 'badDaspfhaH']
INFO     Split: ['matteBa', 'indra', 'viBinna', 'kumBa', 'pi', 'Sita', 'grAsa', 'eka', 'badDaspfhaH']
INFO     Split: ['matteBa', 'indra', 'viBinna', 'kumBa', 'pi', 'Sita', 'grAsa', 'Eka', 'badDaspfhaH']
INFO     Split: ['matteBa', 'indra', 'viBinna', 'kumBa', 'piSita', 'grAsA', 'eka', 'badDaspfhaH']
INFO     Split: ['matteBa', 'indra', 'vi', 'Binna', 'kumBa', 'pi', 'Sita', 'grAsa', 'eka', 'badDaspfhaH']
INFO     Split: ['matteBa', 'indra', 'viBinna', 'kumBa', 'pi', 'Sita', 'grAsA', 'eka', 'badDaspfhaH']
INFO     Split: ['matteBa', 'indra', 'vi', 'Binna', 'kumBa', 'pi', 'Sita', 'grAsa', 'Eka', 'badDaspfhaH']
INFO     Split: ['matteBa', 'indra', 'vi', 'Binna', 'kumBa', 'pi', 'Sita', 'grAsA', 'eka', 'badDaspfhaH']

sanskrit_parser.cmd_line.sandhi would be a better prototype for your code.

kmadathil commented 4 years ago

Some of the api documentation needs to be relooked at.

avinashvarna commented 4 years ago

Thanks @kmadathil ! I must say, on the whole I'm pretty impressed/happy that the parser produced a good quality split for such a long pada!

parthsad commented 4 years ago

I tried with the cmd line tool and it worked. But with the entire verse -

sanskrit_parser sandhi "kzutkzAmopijarAkfSopiSiTilaprAyopikazwAMdaSAmApannopivipannadIDitirapiprAyezunaSyatsvapi matteBendravipinnakumBapiSitagrAsEkabadDaspFhaHkiMjIrRaMtfRamattimAnamahatAmagresaraHkesarI"

ParseResult('kzutkzAmopijarAkfSopiSiTilaprAyopikazwAMdaSAmApannopivipannadIDitirapiprAyezunaSyatsvapi matteBendravipinnakumBapiSitagrAsEkabadDaspFhaHkiMjIrRaMtfRamattimAnamahatAmagresaraHkesarI')

Traceback (most recent call last): File "/usr/local/bin/sanskrit_parser", line 8, in sys.exit(cmd_line()) File "/usr/local/lib/python3.7/site-packages/sanskrit_parser/cmd_line.py", line 219, in cmd_line eval(getattr(args, 'command')+"(rest)") File "", line 1, in File "/usr/local/lib/python3.7/site-packages/sanskrit_parser/cmd_line.py", line 116, in sandhi for si, split in enumerate(parse_result.splits(max_splits=args.max_paths)): File "/usr/local/lib/python3.7/site-packages/sanskrit_parser/api.py", line 119, in splits split_results = self.sandhi_graph.find_all_paths(max_paths=max_splits, AttributeError: 'NoneType' object has no attribute 'find_all_paths'

It failed. Not sure if there is an upper limit for the number of aksharas in the vAkya.

kmadathil commented 4 years ago

There is no upper limit in theory. We haven't tested for very long phrases. However, it should not crash in any case.

kmadathil commented 4 years ago

Spelling mistake :-) I corrected the verse, and it works. Nice choice of verse, @poojapi

$ scripts/sanskrit_parser sandhi "kzutkzAmopijarAkfSopiSiTilaprAyopikazwAMdaSAmApannopivipannadIDitirapiprAyezunaSyatsvapi matteBendraviBinnakumBapiSitagrAsEkabadDaspfhaHkiMjIrRaMtfRamattimAnamahatAmagresaraHkesarI"

The first few splits aren't the best possible. There's a line of attack there - why is that happening. Secondly, after the api change, we do not seem to gracefully signal that no split exists Thirdly, how do we avoid input errors like this? Should we output Devanagari as well so it can be easily checked?

@avinashvarna Thoughts?

avinashvarna commented 4 years ago

@kmadathil Good catch! Part of the reason for the splits with 'U' etc is that MW dictionary (which is one of the pada sources we use) has a lot of these ekAkShara words. The lexical scorer produces a higher score for these compared to the longer words - likely a side-effect of using sentencepiece in the scoring. Maybe a combined score that weights the number of words with the lexical score could help, or improving the lexical scorer. In general, we need a better language model to score the splits.

We need to fix the second issue and gracefully handle the case when no split exists.

I am not sure we could do much about the third one in this package. Outputting devanagari may help in cases where it is supported - but this doesn't work well in certain terminals which can't handle unicode characters. The web interface outputs devanagari already, if I am not wrong. Maybe as part of gracefully handling the case where no splits are found, we print a message asking the user to carefully check the input for typos/errors?

avinashvarna commented 4 years ago

Regarding a better approach for scoring, we already have #93 open to discuss options

kmadathil commented 4 years ago

I'll open another issue for gracefully handling the no-splits-available case.

Carefully checking isn't that easy, as we find here. I didn't notice the spelling error initially, and had to resort to good old binary search to debug and finally figure out that there was a spelling error :-) Perhaps an option to output devanagari instead of SLP1 would help, which could be turned on if the terminal does support it.

As you say, we may need to promote the web version more, but I hesitate to do that, because the backend is usually out-of-date. I'd started the work of creating a AWS Lambda backend, but haven't completed it.

gasyoun commented 4 years ago

I want to test it. How to?

kmadathil commented 4 years ago

Closing based on #129

kmadathil commented 4 years ago

@gasyoun - presuming you want to test vakya analyzer, it is preliminary right now, and can handle simple sentences. The best way to test it right now, while we get the web interface updated is to install it with pip install sanskrit_parser, and run it as in the docs page https://kmadathil.github.io/sanskrit_parser/build/html/

avinashvarna commented 4 years ago

I'll also open an issue for the web interface, and tag it "help wanted". Maybe some new contributor can help us with this.

kmadathil commented 4 years ago

I've opened #134