kmadathil / sanskrit_parser

Parsers for Sanskrit / संस्कृतम्
MIT License
69 stars 21 forks source link

Bhagavad Gita 1.1 returns no parse results: #182

Open akprasad opened 2 years ago

akprasad commented 2 years ago

Minimal example:

from sanskrit_parser import Parser

def api_example(string, output_encoding):
    buf = []
    parser = Parser(output_encoding=output_encoding)
    buf.append('Splits:')
    for split in parser.split(string, limit=2):
        buf.append(f'Lexical Split: {split}')
        for i, parse in enumerate(split.parse(limit=2)):
            buf.append(f'Parse {i}')
            buf.append(f'{parse}')
    return '\n'.join(buf)

for phrase in [
    'Darmakzetre kurukzetre samavetA yuyutsavaH',
    'mAmakAH pARqavAScEva kimakurvata saMjaya',
    ]:
    resp = api_example(phrase, 'slp1')
    print(resp)

Each phrase has lexical splits with no parse information. Output is:

Splits:
Lexical Split: ['Darmakzetre', 'kurukzetre', 'samavetAH', 'yuyutsavaH']
Lexical Split: ['Darmakzetre', 'kurukzetre', 'samavetA', 'yuyutsavaH']

And:

Splits:
Lexical Split: ['mAmakAH', 'pARqavAH', 'cA', 'Eva', 'kim', 'akurvata', 'saYjaya']
Lexical Split: ['mAmakAH', 'pARqavAH', 'ca', 'eva', 'kim', 'akurvata', 'saYjaya']

I'm not sure what I'm doing wrong here -- would appreciate any help you can provide.

akprasad commented 2 years ago

Also, I get around 1.8 seconds per verse:


import time
num_trials = 20
start = time.time()
for i, phrase in enumerate([
    'Darmakzetre kurukzetre samavetA yuyutsavaH',
    'mAmakAH pARqavAScEva kimakurvata saMjaya',
    ] * num_trials):
    resp = api_example(phrase, 'slp1')

end = time.time()
print((end - start) / num_trials)

Is there anything we can do to improve performance here? Ideally I'd like around 100ms per verse.

kmadathil commented 2 years ago

@akprasad Arun, thanks for reporting this. This is what's going on

  1. We need a verb of some sort to anchor the parse. However, since samavetAH is a kta form, that should qualify, and so we should be able to generate a parse (though not the correct one without the rest of the sentence). I see that samaveta isn't tagged as a kta in the dictionary. I validated this by parsing Darmakzetre kurukzetre samAgatA yuyutsavaH
  2. I will dig into why the second part isn't being parsed.
  3. Ideally, we should be parsing the entire sentence at a time - Darmakzetre kurukzetre samavetA yuyutsavaH mAmakAH pARqavAScEva kimakurvata saMjaya' to get the correct parse. This is being held up by whatever is holding up 2.

    I will post an update after digging further.

kmadathil commented 2 years ago

'Darmakzetre kurukzetre samavetAH yuyutsavaH kim akurvata saMjaya' now parses correctly, and has been added to the test suite.

This takes roughly 400ms

kmadathil commented 2 years ago

Hi Arun @akprasad

Can we get on a conf call to discuss this weekend to discuss use modes? (Would be good if @avinashvarna can join too). I can demo the UI to you so you can switch to that (or our command line) from the api.

The flow I have in mind (and which is what our UI does) is a two step process

  1. Split sandhis, and let the user pick the right split from an ordered list
  2. Parse the sentence with sandhi split.

    I would like to understand your perspective on how you see yourself using this.

akprasad commented 2 years ago

Sure, let's sync over email.