kmadathil / sanskrit_parser

Parsers for Sanskrit / संस्कृतम्
MIT License
68 stars 21 forks source link

Add Sandhikosh to testing #153

Open avinashvarna opened 3 years ago

avinashvarna commented 3 years ago

Recent publications use the sandhikosh described in this paper as a benchmark. Let's add it to our testing and see where we stand.

(Related to #84)

kmadathil commented 3 years ago

Since this subsumes UoHD, I think we can make this our primary test corpus for sandhi.

We need to find a corpus for parsing.

kmadathil commented 3 years ago

I can see some erroneous spaces (which we can remove programmatically) and clear bad splits in sandhikosh.
They don't split some samAsas (which we do) and usually do not split upasargas (which we also do).

I counted 881 passes and 549 fails on the BhagavadGitA corpus (no edits) and 428 failed, 1002 passed with automated edits to remove spaces.

Some samples, showing issues in the sandhikosh

FAILED test_SandhiKosh.py::test_file_splits[kosh_entry949] - AssertionError: assert ['cittam', 'nirudDam', 'yogasevayA '] in [['cittam', 'nirudDam', 'yoga', 'sevayA'], ['cit', 'tat...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry950] - AssertionError: assert ['ca', 'eva', 'AtmanA', 'AtmAnam paSyan', 'Atmani'] in [['ca', 'eva', 'AtmanA', 'AtmAnam', 'paSy...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry952] - AssertionError: assert ['budDigrAhyam', 'ati', 'indriyam '] in [['budDi', 'grAhyam', 'atIndriyam'], ['budDi', 'grAhyam'...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry954] - AssertionError: assert ['sTitaH', 'calatitattvataH'] in [['sTitaH', 'calati', 'tat', 'tu', 'ataH'], ['sTitaH', 'calati'...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry959] - AssertionError: assert ['guruRA', 'api '] in [['guruRA', 'api'], ['guruRA', 'pi'], ['guruRA', 'Api'], ['guru', 'Ra', 'a...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry960] - AssertionError: assert ['tam', 'vidyAt', 'duHKasaMyogaviyogam', 'yogasaYjYitam '] in [['tam', 'vidyAt', 'duHKa', 'saMyo...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry962] - AssertionError: assert ['yoktavyaH', 'yogaH', 'anirviRRacetasA'] in [['yoktavyaH', 'yogaH', 'asni', 'ru', 'iw', 'Ra', ....
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry963] - AssertionError: assert ['sam', 'kalpapraBavAn', 'kAmAn', 'tyaktvA '] in [['sam', 'kalpa', 'praBavAn', 'kAmAn', 'tyaktvA...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry966] - AssertionError: assert ['samam', 'tataH '] in [['samam', 'tataH'], ['samantataH'], ['samam', 'tat', 'aH'], ['samam', 't...
FAI
kmadathil commented 3 years ago

Take a look at branch multigraph, tests/SandhiKosh. manual_test.py runs tests and outputs to Results.xls. I've run for 1000 tests, with 622 passes. I will run for the full dataset next.

kmadathil commented 3 years ago

Updated - 11080 Tests: 8413 Passed, 1232 Failed, 1430 No_Split, 5 Bad tests

kmadathil commented 3 years ago

Going by the SandhiKosh paper, we are already better than the best result they report (INRIA) for the subset that I ran (BG, Literature, External, UoH).

avinashvarna commented 3 years ago

That's quite impressive! Thanks for adding this. We can look at the failed ones to understand what's happening. I will try to spend some time on it this weekend.

kmadathil commented 3 years ago

Two big sources of discrepancy - SandhiKosh doesn't split some samAsas (which we do) and usually does not split upasargas which we also do, and IMO should. Both of these are proper pada boundaries.

kmadathil commented 3 years ago

This is where we stand on passes:

| Corpus               | Total | JNU |  UoH | INRIA | sanskrit_parser |
|----------------------+-------+-----+------+-------+-----------------|
| Rule based- Internal |   150 |  10 |   27 |     3 |              14 |
| Rule based- External |   132 |  22 |   48 |    38 |              41 |
| Literature           |   150 |  13 |   98 |   101 |              66 |
| Bhagavad-gita        |  1430 |  67 |  650 |   962 |            1002 |
| UoH                  |  9368 | 934 | 6393 |  6490 |            7304 |
| Ashtadhyayi          |  2700 |  18 |  263 |   510 |             616 |

One more issue noticed with the "Internal" set is that sometimes they use a visarga and sometimes an स्

कोऽसिचत् | कस्+असिचत्

वृक्षश्शेते | वृक्षः+शेते
avinashvarna commented 3 years ago

I am not sure if internal sandhis was a targeted use case. Ditto for AshtadhyayI. After all, the pratyayas and various terms in the sutras wouldn't be in any of the standard dictionaries. This probably explains the somewhat poor performance on those.

Is the lower performance on the literature category attributable to the two differences you mentioned before (splitting samasas and upasargas)?

kmadathil commented 3 years ago

Internal sandhi includes upasargas, which we do fine at (barring special cases).

The literature case seems to be mostly test problems. On a casual look, it seems the input is often incompletely split in the test.

On Sat, Jan 16, 2021, 5:29 PM Avinash Varna notifications@github.com wrote:

I am not sure if internal sandhis was a targeted use case. Ditto for AshtadhyayI. After all, the pratyayas and various terms in the sutras wouldn't be in any of the standard dictionaries. This probably explains the somewhat poor performance on those.

Is the lower performance on the literature category attributable to the two differences you mentioned before (splitting samasas and upasargas)?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kmadathil/sanskrit_parser/issues/153#issuecomment-761711518, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKEWNSBL3KAETDAODK62PTS2I4RJANCNFSM4V7MNKWA .

kmadathil commented 3 years ago

Now that the test is in, but we need to scrub failures - adding this comment to state the remaining task

The task is