kmadathil / sanskrit_parser

Parsers for Sanskrit / संस्कृतम्
MIT License
69 stars 21 forks source link

Too long time for sandhi split #148

Closed drdhaval2785 closed 3 years ago

drdhaval2785 commented 3 years ago

Based on the suggestion of @kmadathil at https://github.com/sanskrit-kosha/kosha/issues/38#issuecomment-751585876

I tried the following

$ sanskrit_parser sandhi 'स्वरव्ययं स्वर्गनाकत्रिदिवत्रिदशालयः सुरलोको द्योदिवौ द्वे स्त्रियां क्लीबे त्रिविष्टपम्' --strict
Interpreting input strictly
INFO     Input String: स्वरव्ययं स्वर्गनाकत्रिदिवत्रिदशालयः सुरलोको द्योदिवौ द्वे स्त्रियां क्लीबे त्रिविष्टपम्
INFO     Input String in SLP1: svaravyayaM svarganAkatridivatridaSAlayaH suraloko dyodivO dve striyAM klIbe trivizwapam
Splits:
INFO     Split: ['svar', 'avyayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['svara', 'vi', 'ayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['svara', 'vyayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['svara', 'vyayam', 'svarga', 'nAka', 'tridiva', 'tri', 'daSa', 'Alayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['svara', 'vyayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'tri', 'vizwapam']
INFO     Split: ['sU', 'ar', 'avyayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['sU', 'ar', 'avyayam', 'svarga', 'nAka', 'tridiva', 'tri', 'daSa', 'Alayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['sU', 'ar', 'avyayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'tri', 'vizwapam']
INFO     Split: ['svara', 'vyayam', 'svarga', 'nAka', 'tridiva', 'tri', 'daSa', 'alayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['sU', 'ar', 'avyayam', 'svarga', 'nAka', 'tridiva', 'tri', 'daSa', 'alayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']

The output is fine. But the wall clock showed that there were 22 seconds taken for processing this. Is it OK? or am I missing something.

kmadathil commented 3 years ago

7 seconds on my ancient laptop

$ time !!
time python ../../scripts/sanskrit_parser sandhi 'स्वरव्ययं स्वर्गनाकत्रिदिवत्रिदशालयः सुरलोको द्योदिवौ द्वे स्त्रियां क्लीबे त्रिविष्टपम्' --strict
unable to import 'smart_open.gcs', disabling that module
Interpreting input strictly
INFO     Input String: स्वरव्ययं स्वर्गनाकत्रिदिवत्रिदशालयः सुरलोको द्योदिवौ द्वे स्त्रियां क्लीबे त्रिविष्टपम्
INFO     Input String in SLP1: svaravyayaM svarganAkatridivatridaSAlayaH suraloko dyodivO dve striyAM klIbe trivizwapam
Splits:
INFO     Split: ['svar', 'avyayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['svara', 'vi', 'ayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['svara', 'vyayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['svara', 'vyayam', 'svarga', 'nAka', 'tridiva', 'tri', 'daSa', 'Alayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['svara', 'vyayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'tri', 'vizwapam']
INFO     Split: ['sU', 'ar', 'avyayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['sU', 'ar', 'avyayam', 'svarga', 'nAka', 'tridiva', 'tri', 'daSa', 'Alayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['sU', 'ar', 'avyayam', 'svarga', 'nAka', 'tridiva', 'tridaSAlayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'tri', 'vizwapam']
INFO     Split: ['svara', 'vyayam', 'svarga', 'nAka', 'tridiva', 'tri', 'daSa', 'alayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']
INFO     Split: ['sU', 'ar', 'avyayam', 'svarga', 'nAka', 'tridiva', 'tri', 'daSa', 'alayas', 'sura', 'lokas', 'dyo', 'divO', 'dve', 'striyAm', 'klIbe', 'trivizwapam']

real    0m7.053s
user    0m6.640s
sys     0m0.408s
kmadathil commented 3 years ago

@drdhaval2785 - have you been able to reproduce the timing above, or are you still seeing 22s?

drdhaval2785 commented 3 years ago

Not retested. Will let you know.

kmadathil commented 3 years ago

If you still see higher times, please LMK what your system configuration is.

drdhaval2785 commented 3 years ago

It is still higher.

dhaval@dhaval-Aspire-5750:~$ time !!
time time sanskrit_parser sandhi 'स्वर्गनाकत्रिदिवत्रिदशालयः' --strict
Interpreting input strictly
INFO     Input String: स्वर्गनाकत्रिदिवत्रिदशालयः
INFO     Input String in SLP1: svarganAkatridivatridaSAlayaH
Splits:
INFO     Split: ['svarga', 'nAka', 'tridiva', 'tri', 'daSa', 'Alayas']
INFO     Split: ['svarga', 'nAka', 'tridiva', 'tridaSAlayas']
INFO     Split: ['svarga', 'nAka', 'tridiva', 'tridaSA', 'Alayas']
INFO     Split: ['svarga', 'nAka', 'tridiva', 'tridaSA', 'layas']
INFO     Split: ['svarga', 'nAka', 'tridiva', 'tri', 'daSa', 'alayas']
INFO     Split: ['svarga', 'nAka', 'tridiva', 'tridaSa', 'alayas']
INFO     Split: ['svarga', 'nAka', 'tridiva', 'tridaSA', 'alayas']
INFO     Split: ['svarga', 'nAka', 'tridiva', 'tri', 'daSA', 'alayas']
INFO     Split: ['svarga', 'nAka', 'tridiva', 'tri', 'daSa', 'A', 'layas']
INFO     Split: ['svarga', 'nAka', 'tridiva', 'tri', 'daSa', 'ala', 'yas']

real    1m24.269s
user    0m12.321s
sys     0m1.106s
dhaval@dhaval-Aspire-5750:~$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               42
Model name:          Intel(R) Core(TM) i3-2310M CPU @ 2.10GHz
Stepping:            7
CPU MHz:             966.067
CPU max MHz:         2100.0000
CPU min MHz:         800.0000
BogoMIPS:            4190.42
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            3072K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer xsave avx lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts
avinashvarna commented 3 years ago

I did a quick profiling and on my machine it takes <10 seconds as well. The majority of time is spent in loading the module and data though. E.g. expanding all the sandhi rules takes over half the time of initializing the module.

So one opportunity for improvement would be to pre-expand all the sandhi rules into some form of database rather than doing it each time.

@drdhaval2785 would you be willing to run a profiling test to help us narrow down the source of the slowdown on your machine?

Steps:

  1. Put the following test code into a file. Say profile_test.py
    
    from sanskrit_parser import Parser

string = 'स्वर्गनाकत्रिदिवत्रिदशालयः' parser = Parser(output_encoding='SLP1', replace_ending_visarga=None) parse_result = parser.parse(string) for split in parse_result.splits(max_splits=10): print(f' Split: {split}')


2. Install pyinstrument (https://github.com/joerick/pyinstrument)
`pip install pyinstrument`
3. Run the profiler
` pyinstrument.exe -o profile.txt profile_test.py`
4. Share the output profile.txt file
avinashvarna commented 3 years ago

I pushed a "quick and dirty attempt" at reducing the initialization time to branch fix_148. It is a combination of an attempt to reduce the memory usage (#151) + this issue. Sample results on my machine:

(master) $ time sanskrit_parser sandhi --strict --input-encoding SLP1 svarganAkatridivatridaSAlayaH
real    0m11.454s
user    0m0.015s
sys     0m0.015s

On fix_148 branch

 (fix_148) $ time sanskrit_parser sandhi --strict --input-encoding SLP1 svarganAkatridivatridaSAlayaH
real    0m4.572s
user    0m0.000s
sys     0m0.046s

@drdhaval2785 Could you please try the fix_148 branch when you get a chance and check if you get similar improvements?

avinashvarna commented 3 years ago

To reduce the initialization time issue discussed in this thread, the fix used is to pre-expand the sandhi rules and store them. The necessary forward/backward rules are loaded as needed (i.e. only backward rules are loaded for splitting and forward for joining). This slightly increases the size on disk for a speed-up in module startup. If this solution is what we want to go with, I can clean it up and submit a PR.

Note that this only affects command-line usage, as the module needs to be reinitialized each time. If a script is used to split a lot of strings, this startup cost would be one-time only.

kmadathil commented 3 years ago

Go for it.