kmadathil / sanskrit_parser

Parsers for Sanskrit / संस्कृतम्
MIT License
68 stars 21 forks source link

Metrics for evaluating performance of lexical/morphological analyzer #84

Open avinashvarna opened 6 years ago

avinashvarna commented 6 years ago

Need to develop metrics for evaluating performance of the analyzers. This would be useful if we were trying to choose between databases for looking up tags/ different approaches for lexical/morphological analysis

From https://github.com/kmadathil/sanskrit_parser/issues/82#issuecomment-356168883

Perhaps precision can be defined as the %pass in the UoHD test suite. Perhaps Recall would mean some sort of check if all the reported splits will result in the input sentence after a join?

This would be a good start. Currently we do not pay much attention to the number of pass/fail etc in the test suite. My concern is that the UoHD dataset entries are not broken down to simple roots, and we are using the splitter to split them until we get words that are in the db (as discussed before - https://github.com/kmadathil/sanskrit_parser/issues/19#issuecomment-315433433). I am not sure that this will give us an accurate representation of the performance.

We should start looking into the DCS database to see if it is more appropriate. E.g. for the Level 1 database/tag lookups, we could perhaps just start with the roots provided in the DCS database and see how many are identifiable using the level 1/tag lookup db. We can then start building the tests up to the lexical/morphological levels.

avinashvarna commented 6 years ago

First results from a "quick and dirty" script I wrote to evaluate word lookup accuracy (recall, if you will): The script goes through the DCS database, and for every word tagged as a single word (i.e. no samAsa/sandhi), it checks if the word is recognized as a valid word by the two level 1 lookup options.

Inria lookup recognized 1447362 out of 2333485 words
Sanskrit data recognized 1735547 out of 2333485 words

At a first pass, looks like the sanskrit data based lookup recognized about 300k more words. I think it is definitely worthwhile to move to it. As we incorporate more and more of the Inria db into it, it will always be the better choice from a recall perspective.

It may look like the overall accuracy is quite low, but there are two mitigating factors:

Next steps:

I will clean up my "quick and dirty" script to make it more amenable for the next steps and check it in by the weekend.

avinashvarna commented 6 years ago

I have added some metrics for word level accuracy on the sanskrit_util branch here - https://github.com/kmadathil/sanskrit_parser/tree/sanskrit_util/metrics

I have also started working on evaluating lexical split accuracy using the dataset as part of the project referred to in #85 . Currently planning to use the BLEU score or chrF score (from machine translation literature) to evaluate the accuracy of these splits. Please let me know if there are any other ideas for evaluating accuracy

kmadathil commented 6 years ago

I concur

avinashvarna commented 6 years ago

Scripts for evaluating lexical split accuracy added to scoring branch here - https://github.com/kmadathil/sanskrit_parser/blob/scoring/metrics/lexical_split_scores.py

codito commented 6 years ago

Adding an use case where scoring may help resolve the best split below. Can the tool choose [kaH, cit, naraH, vA, nArI] as the best output?

> python -m sanskrit_parser.lexical_analyzer.sanskrit_lexical_analyzer "kaScit naraH vA nArI" --debug --split
Input String: kaScit naraH vA nArI                                                                                                                             
Input String in SLP1: kaScit naraH vA nArI                                                                                                                     
Start Split
End DAG generation                                                                                                                                             
End pathfinding 1527393212.680358                                                                                                                              
Splits:
[kaH, cit, naraH, vAna, arI]                                                                                                                                   
[kaH, cit, naraH, vAH, nArI]                                                                                                                                   
[kaH, cit, naraH, vA, nArI]                                                                                                                                    
[kaH, cit, na, raH, vAna, arI]                                                                                                                                 
[kaH, cit, naraH, vAH, na, arI]                                                                                                                                
[kaH, cit, naraH, vA, na, arI]                                                                                                                                 
[kaH, cit, na, raH, vAH, nArI]                                                                                                                                 
[kaH, cit, naraH, vA, AnA, arI]                                                                                                                                
[kaH, cit, na, raH, vA, nArI]                                                                                                                                  
[kaH, cit, naraH, vA, A, nArI]
-----------
Performance
Time for graph generation = 0.024774s
Total time for graph generation + find paths = 0.032885s
drdhaval2785 commented 6 years ago

I worked a lot on this problem, and can vouch that https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words/11642687 is the best solution around.

All we need is a frequency count for lexemes. https://github.com/drdhaval2785/samasasplitter/issues/3#issuecomment-312500848 is where some idea about frequencies will be got

kmadathil commented 6 years ago

@codito - Not sure how the whitespace problem and this issue are related? This is about evaluating accuracy, is it not. Your issue is picking one split over another.

codito commented 6 years ago

I thought this issue also tracks using a score to ensure the most likely split gets higher priority in the output. Please ignore if I confused two different things.

gasyoun commented 3 years ago

An Automatic Sanskrit Compound Processing

automatics

anil.pdf

How would you classify the approach?