Closed gasyoun closed 3 years ago
Yes.
1) Please put the 4000 words in a file and write a simple bash script like thus (presuming you're on Linux)
echo "" > out.txt
for i in $(cat 4000_words.txt); do
sanskrit_parser sandhi $i >> out.txt
2) If you want to do it programmatically, please see tests/SandhiKosh/manual_tests.py for an example of reading from CSV files and running sandhi split.
sanskrit_parser sandhi
Does it mean it will break all sandhis or will use some dictionary with sample breakings as a basis? I have never used your code, do not know where to start from.
tests/SandhiKosh/manual_tests.py
https://github.com/kmadathil/sanskrit_parser/tests/SandhiKosh/manual_tests.py ?
For this specific purpose, the online interface works well for testing. E.g. using the sandhi split option at https://sanskrit-parser.appspot.com on the first word in that list:
It breaks all sandhis, uses dictionary to figure out which ones result in valid words, and scores them based on an algorithm trained on DCS. Some details are explained here
It breaks all sandhis
I would say even too many.
uses dictionary to figure out which ones result in valid words
Which dictionary?
scores them based on an algorithm trained on DCS
So it uses DCS frequency, right? The higher, the more possible?
Because to generate a list of all possible splits at each point
can
become confusing, is there a more where only the first result is
returned?
sambhūya sam utthāna
sambhūya samutthāna
sambhūya sam utthā na
sambhūya sam utthān a
sam bhūyasam utthāna
sambhūya samutthān a
sambhūya sam utthāḥ na
sambhūya sam utthā ana
sambhū ya sam utthāna
sambhūya sam uttha ana
memoized -> memorized
When you say This uses a word2vec based scoring approach
does it mean you go the Prioritize paths with a lower score (default)
way?
I would say even too many.
Agreed. When we say all, we mean "all possible". Over-generation is quite likely currently. As long as they are all grammatically valid though, there is no easy way to figure out the correct one without a broader context. Some of the over-splitting is also due to a lot of uncommon words present in the MW dictionary which we use for checking if a word is valid. If you have any ideas on how to filter the splits, we would be interested in hearing and perhaps incorporating them.
Which dictionary?
Currently INRIA lexicon and MW
So it uses DCS frequency, right? The higher, the more possible?
Not frequency directly, but something similar. Yes, the results are sorted based on likelihood.
Because to generate a list of all possible splits at each point can become confusing, is there a more where only the first result is returned?
On the command line, or the python library, there is an option to specify the maximum number of results to return. The web UI has this set to 10. Cells 5 and 7 in the example notebook show how to set the limit
and retain just the first result when using the python library.
memoized -> memorized
Same concept, but memoization is the term used in programming jargon - https://en.wikipedia.org/wiki/Memoization
When you say This uses a word2vec based scoring approach does it mean you go the Prioritize paths with a lower score (default) way?
Yes, that is the default behavior, but can be turned off via command line or other options. The score for each path is computed using a word2vec
model trained on DCS
Some of the over-splitting is also due to a lot of uncommon words present in the MW dictionary
MD i.e. Macdonell seems to be ideal candidate for replacing MW to retain only widely used words. Used to give me more decent results than MW.
Thanks @drdhaval2785, will look into it. We should probably file a separate issue to track such an effort.
As long as they are all grammatically valid though, there is no easy way to figure out the correct one without a broader context.
Half of them are agrammatical.
sambhūya sam utthān a
Even for vadic it woudl be too much.
In sambhūya sam utthāna
you do not split as sam bhūya sam utthāna
, why? Why is not sambhūya samutthāna
above the rest?
Currently INRIA lexicon
It had around 25k words last time I checked. Why do you do not use Apte's vyutpatti?
MD i.e. Macdonell
Around 33k words, so similar in size to INRIA.
I try to run the program under windows. i run setup.py bdist_wininit and i got binary installation file. i run it and all installation is correct. the command Sanskrit_parser not found, then i run cmd_line.py and i got an error: ModuleNotFoundError: No Module Named 'indic_transliteration'' Can you help me, please.. With the best regards. Vladimir.
If you have pip, can you try py -m pip install sanskrit_parser
? That will
install all the dependencies as well and put the script on your path.
https://pip.pypa.io/en/stable/installing/
On Fri, Mar 26, 2021, 6:56 AM VladimirWl @.***> wrote:
I try to run the program under windows. i run setup.py bdist_wininit and i got binary installation file. i run it and all installation is correct. the command Sanskrit_parser not found, then i run cmd_line.py and i got an error: ModuleNotFoundError: No Module Named 'indic_transliteration'' Can you help me, please.. With the best regards. Vladimir.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kmadathil/sanskrit_parser/issues/164#issuecomment-808240089, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKEWNUVXLKIZS4CQ4VZBE3TFSHCVANCNFSM4ZWAZAZA .
Thank you very much. but now it asks for "gensim" module. i'll try to reinstall python.
Strange. gensim
is a dependency, and is in setup.py
, so it should be automatically installed if you do py -m pip install sanskrit_parser
.
These are the dependencies that should get automatically installed
install_requires=['indic_transliteration!=1.9.5,!=1.9.6', 'lxml', 'networkx', 'tinydb',
'six', 'flask', 'flask_restx', 'flask_cors',
'jsonpickle', 'sanskrit_util', 'sqlalchemy<1.4',
'sentencepiece', 'gensim', 'pydot', 'pandas', 'xlrd'],
When i install it i got folowing error: (Screenshot link see below) https://yadi.sk/i/r38mInyUfZRT8A
Looks like you need a C++ compiler to compile the gensim model. This is not essential, if you just want to try it out without using lexical scoring (only difference will be the ordering of your sandhi splits). Just add --no-score
to the command line when you try sandhi split (this is default when you do vakya analysis, so that should work out of the box.) Can you confirm that all your other dependencies were installed without errors?
As the error message says, if you install Microsoft Visual C++ 14.0, gensim should install properly.
@avinashvarna - can you suggest anything else?
yes. all other dependences were installed ok. just gensim. i'l try without scoring..
only difference will be the ordering of your sandhi splits
Scoring is important, we would not want to live without it.
It's up to you, we provide both options.
If you want to use scoring, but can't/don't want to install the c++ compiler to get gensim
working, here are two options:
curl -X GET "https://sanskrit-parser.appspot.com/sanskrit_parser/v1/splits/sambh%C5%AByasamutth%C4%81na" -H "accept: application/json"
or
https://sanskrit-parser.appspot.com/sanskrit_parser/v1/splits/sambh%C5%AByasamutth%C4%81na
Thank you very much!
Nothing works. How to feed 4000 words to your API? Is it the one? How to install it?
https://github.com/kmadathil/sanskrit_parser/blob/master/sanskrit_parser/rest_api/api_v1.py
@gasyoun - Please see Avinash's answer above.
Avinash's answer above.
I'm sorry to say it's not understandable by someone outside your box. Let me add a few examples:
If we are running on Binder, we can skip this step
https://mybinder.org/ is meant or https://jupyter.org/binder ?
As an example, let us try a long phrase from the चम्पूरामायणम् of भोजः । We will ask the parser to find at most 10 splits.
It's just one phrase. How to feed 4000 such at once?
REST API of sanskrit-parser.appspot.com documented her
If @avinashvarna this is self-obvious to you, I must say it's not to me, I do not see an example. Three Russians have tried to replicate what you have said and failed - with different error messages.
In a week from now we have paper on a conference where I would like to tell about your approach, but I can't because nothing works in batch for us. Can I ask for a favor? Could you please split the file for me and attach the results here? Otherwise we already fail with the deadlines and will just have to abandon the idea that the code can actually help us in samāsa splitting, thanks.
This project has been tested and developed using Python 3.7.
- I would add that 3.8
and 3.9
will not work, only 3.7
.
Partial output of the following bash
command attached (this is with scoring, and the full dictionary set):
for k in $(cat /tmp/mozilla_karthick0/13039-words-to-be-split.txt); do scripts/sanskrit_parser sandhi $k --max 1 >> result.txt 2>&1; done
If this works for you, you can run the Windows equivalent to get the full result. result.txt
Full results of the above run - this is the first split of each input only result.txt
Thank you so much for helping us!!!
If this works for you, you can run the Windows equivalent to get the full result.
Thanks, @kmadathil you helped us out. We will continue, but seems there are still lots of things we do wrong.
Thanks!
Samasa splitter was not invented for first time by Dhaval, but guess it was the first open code one.
https://github.com/drdhaval2785/samasasplitter/issues/1
Is there a way to split 4000 words in a row with
sanskrit_parser
?Full list to be split here https://github.com/funderburkjim/MWderivations/issues/14