kmadathil / sanskrit_parser

Parsers for Sanskrit / संस्कृतम्
MIT License
68 stars 21 forks source link

Batch samasa splitter #164

Closed gasyoun closed 3 years ago

gasyoun commented 3 years ago

Samasa splitter was not invented for first time by Dhaval, but guess it was the first open code one.

https://github.com/drdhaval2785/samasasplitter/issues/1

Is there a way to split 4000 words in a row with sanskrit_parser?

Full list to be split here https://github.com/funderburkjim/MWderivations/issues/14

kmadathil commented 3 years ago

Yes.

1) Please put the 4000 words in a file and write a simple bash script like thus (presuming you're on Linux)

   echo "" > out.txt  
   for i in $(cat 4000_words.txt); do 
      sanskrit_parser sandhi $i >> out.txt

2) If you want to do it programmatically, please see tests/SandhiKosh/manual_tests.py for an example of reading from CSV files and running sandhi split.

gasyoun commented 3 years ago

sanskrit_parser sandhi

Does it mean it will break all sandhis or will use some dictionary with sample breakings as a basis? I have never used your code, do not know where to start from.

tests/SandhiKosh/manual_tests.py

https://github.com/kmadathil/sanskrit_parser/tests/SandhiKosh/manual_tests.py ?

avinashvarna commented 3 years ago

For this specific purpose, the online interface works well for testing. E.g. using the sandhi split option at https://sanskrit-parser.appspot.com on the first word in that list: image

It breaks all sandhis, uses dictionary to figure out which ones result in valid words, and scores them based on an algorithm trained on DCS. Some details are explained here

gasyoun commented 3 years ago

It breaks all sandhis

I would say even too many.

uses dictionary to figure out which ones result in valid words

Which dictionary?

scores them based on an algorithm trained on DCS

So it uses DCS frequency, right? The higher, the more possible? Because to generate a list of all possible splits at each point can become confusing, is there a more where only the first result is returned?

sambhūya sam utthāna
sambhūya samutthāna
sambhūya sam utthā na
sambhūya sam utthān a
sam bhūyasam utthāna
sambhūya samutthān a
sambhūya sam utthāḥ na
sambhūya sam utthā ana
sambhū ya sam utthāna
sambhūya sam uttha ana

memoized -> memorized

When you say This uses a word2vec based scoring approach does it mean you go the Prioritize paths with a lower score (default) way?

avinashvarna commented 3 years ago

I would say even too many.

Agreed. When we say all, we mean "all possible". Over-generation is quite likely currently. As long as they are all grammatically valid though, there is no easy way to figure out the correct one without a broader context. Some of the over-splitting is also due to a lot of uncommon words present in the MW dictionary which we use for checking if a word is valid. If you have any ideas on how to filter the splits, we would be interested in hearing and perhaps incorporating them.

Which dictionary?

Currently INRIA lexicon and MW

So it uses DCS frequency, right? The higher, the more possible?

Not frequency directly, but something similar. Yes, the results are sorted based on likelihood.

Because to generate a list of all possible splits at each point can become confusing, is there a more where only the first result is returned?

On the command line, or the python library, there is an option to specify the maximum number of results to return. The web UI has this set to 10. Cells 5 and 7 in the example notebook show how to set the limit and retain just the first result when using the python library.

memoized -> memorized

Same concept, but memoization is the term used in programming jargon - https://en.wikipedia.org/wiki/Memoization

When you say This uses a word2vec based scoring approach does it mean you go the Prioritize paths with a lower score (default) way?

Yes, that is the default behavior, but can be turned off via command line or other options. The score for each path is computed using a word2vec model trained on DCS

drdhaval2785 commented 3 years ago

Some of the over-splitting is also due to a lot of uncommon words present in the MW dictionary

MD i.e. Macdonell seems to be ideal candidate for replacing MW to retain only widely used words. Used to give me more decent results than MW.

avinashvarna commented 3 years ago

Thanks @drdhaval2785, will look into it. We should probably file a separate issue to track such an effort.

gasyoun commented 3 years ago

As long as they are all grammatically valid though, there is no easy way to figure out the correct one without a broader context.

Half of them are agrammatical.

sambhūya sam utthān a

Even for vadic it woudl be too much.

In sambhūya sam utthāna you do not split as sam bhūya sam utthāna, why? Why is not sambhūya samutthāna above the rest?

Currently INRIA lexicon

It had around 25k words last time I checked. Why do you do not use Apte's vyutpatti?

MD i.e. Macdonell

Around 33k words, so similar in size to INRIA.

VladimirWl commented 3 years ago

I try to run the program under windows. i run setup.py bdist_wininit and i got binary installation file. i run it and all installation is correct. the command Sanskrit_parser not found, then i run cmd_line.py and i got an error: ModuleNotFoundError: No Module Named 'indic_transliteration'' Can you help me, please.. With the best regards. Vladimir.

kmadathil commented 3 years ago

If you have pip, can you try py -m pip install sanskrit_parser ? That will install all the dependencies as well and put the script on your path.

https://pip.pypa.io/en/stable/installing/

On Fri, Mar 26, 2021, 6:56 AM VladimirWl @.***> wrote:

I try to run the program under windows. i run setup.py bdist_wininit and i got binary installation file. i run it and all installation is correct. the command Sanskrit_parser not found, then i run cmd_line.py and i got an error: ModuleNotFoundError: No Module Named 'indic_transliteration'' Can you help me, please.. With the best regards. Vladimir.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kmadathil/sanskrit_parser/issues/164#issuecomment-808240089, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKEWNUVXLKIZS4CQ4VZBE3TFSHCVANCNFSM4ZWAZAZA .

VladimirWl commented 3 years ago

Thank you very much. but now it asks for "gensim" module. i'll try to reinstall python.

kmadathil commented 3 years ago

Strange. gensim is a dependency, and is in setup.py, so it should be automatically installed if you do py -m pip install sanskrit_parser.

These are the dependencies that should get automatically installed

                     install_requires=['indic_transliteration!=1.9.5,!=1.9.6', 'lxml', 'networkx', 'tinydb',
                    'six', 'flask', 'flask_restx', 'flask_cors',
                    'jsonpickle', 'sanskrit_util', 'sqlalchemy<1.4',
                    'sentencepiece', 'gensim', 'pydot', 'pandas', 'xlrd'],
VladimirWl commented 3 years ago

When i install it i got folowing error: (Screenshot link see below) https://yadi.sk/i/r38mInyUfZRT8A

kmadathil commented 3 years ago

Looks like you need a C++ compiler to compile the gensim model. This is not essential, if you just want to try it out without using lexical scoring (only difference will be the ordering of your sandhi splits). Just add --no-score to the command line when you try sandhi split (this is default when you do vakya analysis, so that should work out of the box.) Can you confirm that all your other dependencies were installed without errors?

As the error message says, if you install Microsoft Visual C++ 14.0, gensim should install properly.

@avinashvarna - can you suggest anything else?

VladimirWl commented 3 years ago

yes. all other dependences were installed ok. just gensim. i'l try without scoring..

gasyoun commented 3 years ago

only difference will be the ordering of your sandhi splits

Scoring is important, we would not want to live without it.

kmadathil commented 3 years ago

It's up to you, we provide both options.

avinashvarna commented 3 years ago

If you want to use scoring, but can't/don't want to install the c++ compiler to get gensim working, here are two options:

VladimirWl commented 3 years ago

Thank you very much!

gasyoun commented 3 years ago

Nothing works. How to feed 4000 words to your API? Is it the one? How to install it?

https://github.com/kmadathil/sanskrit_parser/blob/master/sanskrit_parser/rest_api/api_v1.py
kmadathil commented 3 years ago

@gasyoun - Please see Avinash's answer above.

gasyoun commented 3 years ago

Avinash's answer above.

I'm sorry to say it's not understandable by someone outside your box. Let me add a few examples:

If we are running on Binder, we can skip this step

https://mybinder.org/ is meant or https://jupyter.org/binder ?

As an example, let us try a long phrase from the चम्पूरामायणम् of भोजः । We will ask the parser to find at most 10 splits.

It's just one phrase. How to feed 4000 such at once?

REST API of sanskrit-parser.appspot.com documented her

saddsaddsaads

If @avinashvarna this is self-obvious to you, I must say it's not to me, I do not see an example. Three Russians have tried to replicate what you have said and failed - with different error messages.

fdsfsdfsdsfdsfdsdsd

In a week from now we have paper on a conference where I would like to tell about your approach, but I can't because nothing works in batch for us. Can I ask for a favor? Could you please split the file for me and attach the results here? Otherwise we already fail with the deadlines and will just have to abandon the idea that the code can actually help us in samāsa splitting, thanks.

13039-words-to-be-split.txt

This project has been tested and developed using Python 3.7. - I would add that 3.8 and 3.9 will not work, only 3.7.

kmadathil commented 3 years ago

Partial output of the following bash command attached (this is with scoring, and the full dictionary set):

for k in $(cat /tmp/mozilla_karthick0/13039-words-to-be-split.txt); do scripts/sanskrit_parser sandhi $k --max 1 >> result.txt 2>&1; done

If this works for you, you can run the Windows equivalent to get the full result. result.txt

kmadathil commented 3 years ago

Full results of the above run - this is the first split of each input only result.txt

VladimirWl commented 3 years ago

Thank you so much for helping us!!!

gasyoun commented 3 years ago

If this works for you, you can run the Windows equivalent to get the full result.

Thanks, @kmadathil you helped us out. We will continue, but seems there are still lots of things we do wrong.

kmadathil commented 3 years ago

Thanks!