indic-dict / stardict-sanskrit-vyAkaraNa

3 stars 4 forks source link

Add verb forms from ambuda vidyut prakriyA #18

Open vvasuki opened 7 months ago

vvasuki commented 7 months ago

Offline stardict dictionaries for verb forms are quite popular. For example, https://github.com/indic-dict/stardict-sanskrit-vyAkaraNa/issues/17 from @suhasm and https://github.com/indic-dict/stardict-sanskrit-vyAkaraNa/issues/1 .

https://ambuda-org.github.io/vidyullekha/?tab=dhatu&dhatu=01.0002&sanadi=5 has much attractive data which could be turned into stardict dictionaries. It seems quite thorough - https://ambuda-org.github.io/vidyullekha/?tab=dhatu&dhatu=01.0862&sanadi=5

Would it be possible to dump the data as babylon files (one for sanAdi, one for yaN etc..) @akprasad ? (Format is quite simple - https://raw.githubusercontent.com/indic-dict/stardict-sanskrit-vyAkaraNa/master/ashtadhyayi_com_san/ashtadhyayi_com_san.babylon . Thence our scripts will autogenerate and distribute the dictionaries.)

akprasad commented 7 months ago

I can certainly dump the data and plan to have some scripts committed for doing so to CSV.

Since babylon files are useful for stardict, I am happy to explore supporting them as well. But I have some questions:

1. What is the data format?

It looks like:

2. Do you need only लँट् forms?

I'm happy to generate all lakAras. It's the same amount of work either way.

3. Does the output need to be in Devanagari?

Vidyut's current transliterator is not very good, but I believe it should handle simple SLP1 --> Devanagari. Just confirming if you need Devanagari.

4. Can I use my existing Dhatupatha?

Our dhatupatha.tsv file comes from ashtadhyayi.com. Using this same dhatupatha file will be substantially easier for me.

~

Also, I am curious what these forms would be for, since you are collecting similar forms in #17 already.

vvasuki commented 7 months ago

I can certainly dump the data and plan to have some scripts committed for doing so to CSV.

Since babylon files are useful for stardict, I am happy to explore supporting them as well. But I have some questions:

1. What is the data format?

It looks like:

  • 2 lines per entry
  • entries are separated by blank lines
  • the first line is all the possible search terms separated by | (including the dhatu)
  • the second line is the same data rendered as HTML, including the dhatu meaning.

Exactly. (Not mentioned above are the first few "header" lines.)

  • --- separates purushas
  • each vacana is on its own line
  • / separates forms for the same purusha / vacana

Yes - that's good enough - if you can make a better html table, that's welcome too (as long as file size doen't blow up beyond 90MB). Most common use case is to look up the root of a given tiNanta.

2. Do you need only लँट् forms?

I'm happy to generate all lakAras. It's the same amount of work either way.

All lakAras. Separate file for each (san/ ya~N/ etc..) pratyaya lakAra combination (to keep file sizes manageable).

3. Does the output need to be in Devanagari?

Vidyut's current transliterator is not very good, but I believe it should handle simple SLP1 --> Devanagari. Just confirming if you need Devanagari.

Yes please. For these standard dhAtus transilteration shouldn't be a problem.

4. Can I use my existing Dhatupatha?

Our dhatupatha.tsv file comes from ashtadhyayi.com. Using this same dhatupatha file will be substantially easier for me.

Sure.

Also, I am curious what these forms would be for, since you are collecting similar forms in #17 already.

Just as one can look up a root in multiple dictionaries, one should be able to look up forms generated by multiple sources. I anticipate that vidyut data will be the most thorough and reliable.

akprasad commented 7 months ago

Sure, then I'll take this on.

I expect I'll have something to share tomorrow. How should I send you the files -- attachment on this GitHub issue? email?

akprasad commented 7 months ago

A preview:

TODO |boBUyate|boBUyete|boBUyante|boBUyase|boBUyeTe|boBUyaDve|boBUye|boBUyAvahe|boBUyAmahe
TODO BU sattAyAm Atmanepadi-law<br><br>boBUyate<br>boBUyete<br>boBUyante<br>---<br>boBUyase<br>boBUyeTe<br>boBUyaDve<br>---<br>boBUye<br>boBUyAvahe<br>boBUyAmahe

Work remaining:

Otherwise, please double check the output format above and let me know if it looks as expected.

vvasuki commented 7 months ago
  • TODO should be replaced with the actual dhatu after इत्संज्ञा-लोप, सत्व, नत्व, नुम्, etc.
  • The text will of course be in Devanagari instead of SLP1.

Otherwise, please double check the output format above and let me know if it looks as expected.

Looks good. Besides the actual dhatu after इत्संज्ञा-लोप, सत्व, नत्व, नुम्, etc., it would be welcome to add additional headword like बोभू (which attains dhAtu-saMjNA as well) as well as the aupadeshika dhAtu in case it is different (users might look up forms based on that).

How should I send you the files -- attachment on this GitHub issue? email?

PS: I edited the previous comment to - "Separate file for each (san/ ya~N/ etc..) pratyaya" since we having all lakAra-s in a single file remains manageable.

Could you send a pull request with the following paths (convention is that dir-name and file-name minus extension are same):

Also you can add a description to the header, like:


#bookname=vidyut-yaN (sa-sa)
#description=विद्युद्-यन्त्रेण जनितानि यङन्तेभ्यस् तिङन्तानि । https://ambuda-org.github.io/vidyullekha/ इति दृश्यताम्। 
akprasad commented 7 months ago

it would be welcome to add additional headword like बोभू

Done. Sample for यङ्, still in SLP1 --

vand|vAvandya|vAvandyate|vAvandyete|vAvandyante|vAvandyase|vAvandyeTe|vAvandyaDve|vAvandye|vAvandyAvahe|vAvandyAmahe
vand vAvandya vadi~ aBivAdanastutyoH Atmanepadi-law<br><br>vAvandyate<br>vAvandyete<br>vAvandyante<br>---<br>vAvandyase<br>vAvandyeTe<br>vAvandyaDve<br>---<br>vAvandye<br>vAvandyAvahe<br>vAvandyAmahe

Once I add the transliteration to Devanagari, I'll create a pull request.

akprasad commented 7 months ago

Opened as #19.

I can't promise any ongoing support (bug fix PRs, other sanAdi-pratyayas, etc.), but I'll maintain the script I used as part of the examples directory in vidyut-prakriya so that you can regenerate these files whenever you wish. See the PR description for details.

[EDIT] Relevent instructions from the PR:

Generated from a dev build of vidyut-prakriya 1, a Paninian word generator.

Our generation setup is not checked in yet but will be within a week of this commit. The setup will be available in the examples directory of the vidyut-prakriya project.

Sample usage, for future reference. Note that this API is not stable and might change in the future.

cargo run --release --example create_tinantas_babylon -- \
--sanadi yaN \
--desc "vidyud-yantreRa janitAni yaNanteByas tiNantAni" > vidyut_yaN.babylon
cargo run --release --example create_tinantas_babylon -- \
--sanadi yaNluk \
--desc "vidyud-yantreRa janitAni yaNluganteByas tiNantAni" > vidyut-yaN-luk.babylon
vvasuki commented 7 months ago

Thanks. Got the dicts (merged in order to inspect quickly since I couldn't do it online).

Bugs

I notice the following upon installation in my computer:

From ashtadhyayi_com_yangluk (sa-sa)Collapse article
बोभवीति

भू भू सत्तायाम् परस्मैपदि-लट्
बोभवीति / बोभोति
बोभूतः
बोभूवति
---
बोभवीषि / बोभोषि
बोभूथः
बोभूथ
---
बोभवीमि / बोभोमि
बोभूवः
बोभूमः
From vidyut-yaN-luk (sa-sa)Collapse article
बोभवीति

भू बोभू (भू सत्तायाम् परस्मैपदि-लट्)
बोभवीति / बोभोति
बोभूतं
बोभुवति
---
बोभवीषि / बोभोषि
बोभूथं
बोभूथ
---
बोभवीमि / बोभोमि
बोभूवं
बोभूमं

I observe the difference between the aShTAdhyAyI.com forms and vidyut forms - is the latter incorrect? Seems to be a problem with devanAgarI conversion of visarga.

Do you mind correcting and sending another pull request? (PS: the file should start with an empty line, which I've added manually. )

API

Also, regarding the API usage (which I've copied to your comment above), is the process as follows:

akprasad commented 7 months ago

is the latter incorrect?

Yes -- this is a transliteration bug, not a vidyut bug. I'll prepare a PR to fix.

PS: the file should start with an empty line, which I've added manually. )

:+1:

is the process as follows

Yes. You'll need to install Rust, but otherwise that's it.

vvasuki commented 7 months ago

is the latter incorrect?

Yes -- this is a transliteration bug, not a vidyut bug. I'll prepare a PR to fix.

Got your PR - looks fine, thanks!

is the process as follows

Yes. You'll need to install Rust, but otherwise that's it.

Could you merge this examples directory into the main branch, and also let me know commands to generate similar babylon files for कर्तरि, कर्मणि/भावे, णिच्, सन्, णिच्+कर्मणि/भावे, यङ् +कर्मणि/भावे and यङ्-लुक् +कर्मणि/भावे forms?

akprasad commented 7 months ago

Could you merge this examples directory into the main branch

Sure, will likely happen within the next few days. I am currently improving सन्नन्तs and would like to have those changes as part of the commit.

commands to generate similar babylon files for कर्तरि, कर्मणि/भावे, णिच्, सन्, णिच्+कर्मणि/भावे, यङ् +कर्मणि/भावे and यङ्-लुक् +कर्मणि/भावे forms

Sure. Do you also want support for upasargas? This would inflate the output but it would let you capture words like सङ्गच्छते, etc.

akprasad commented 7 months ago

Oh, I should also say for the record how these forms have been verified. For yan and yan-luk, I've confirmed that the program correctly generates the exact output specified in the यङ्-प्रकरणम् and यङ्-लुक्-प्रकरणम् of the वैयाकरणसिद्धान्तकौमुदी. I am currently doing the same for सन्नन्तs. णिजन्तs currently have various small issues in लुङ् as compared to the examples in the कौमुदी.

vvasuki commented 7 months ago

Sure. Do you also want support for upasargas? This would inflate the output but it would let you capture words like सङ्गच्छते, etc.

Ideally yes - but I am reluctant (for now) to inflate file sizes (or numbers) by 20x (though the final distributed dictionary files, being compressed, might be quite small).

Is it possible to just special sopasarga-dicts for unusual cases like सङ्गच्छते?

exact output specified in the यङ्-प्रकरणम् and यङ्-लुक्-प्रकरणम् of the वैयाकरणसिद्धान्तकौमुदी.

FWIW, I recall (puShpA dIxit saying) that these forms, per classical tradition, are supposed to be used only if they have established presence in laxya; and should not be used just because they can be generated. Alas - we don't have such a database.

drdhaval2785 commented 7 months ago

Regarding sopasarga dhAtus, I have supplied the headwords at https://github.com/vishvAsa/sanskrit/pull/45 .

Every line starting with triple hash is a headword. It would cover majority of cases where the verb meaning is changed specifically by specific upasarga. It may also be covering the sopasarga dhAtus having special grammatical properties like different pada / iT etc mentioned in grammatical tradition.

Adding only these verb forms will not inflate the file size too much.

vvasuki commented 7 months ago

Adding only these verb forms will not inflate the file size too much.

This is excellent. So the chArudeva-shAstri's list has about 4.5k sopasarga-dhAtus (rather than 80k). Could you come up with a list which @akprasad can use to generate the tiNantas? The file could have lines like:

अभि,आङ्,गमॢ
सं,गमॢ
akprasad commented 7 months ago

Could you merge this examples directory into the main branch

Pushed.

commands to generate similar babylon files for कर्तरि, कर्मणि/भावे, णिच्, सन्, णिच्+कर्मणि/भावे, यङ् +कर्मणि/भावे and यङ्-लुक् +कर्मणि/भावे forms

It's broadly the same command. For कर्मणि, you can pass --prayoga karmani. For other सनादिप्रत्ययाः, you can pass their names in SLP1: Ric, yaN, yaNluk, san. If you don't pass --sanadi, none will be used.

For details on the errors in these forms, see the test files, in particular kaumudi_43.rs to kaumudi_56.rs. Tests that have the #[ignore] contain one or more words that vidyut-prakriya cannot generatore.

I've greatly improved the quality of our san support, so that might be a good candidate to try next. The one error we need to fix is support for अधिजिगमिषति/अधीषिषति.

vvasuki commented 7 months ago

कृतार्थोऽस्मि @akprasad । Is there a nice way to call #cargo run --release --example create_tinantas_babylon -- --sanadi san --desc "vidyud-yantreRa janitAni sannteByas tiNantAni" > /home/vvasuki/gitland/indic-dict_stardict/stardict-sanskrit-vyAkaraNa/tiNanta/vidyut/vidyut-san/vidyut-san.babylon from a python script? (without simply running it as a shell command from python?)

Also can you give me the command to generate babylon for kRdanta-s (example format 1 2 ) ?

akprasad commented 7 months ago

Is there a nice way to call [...] from a python script? (without simply running it as a shell command from python?)

vidyut-prakriya has Python bindings but I haven't updated them in several months. I plan to update them this week. I don't know how fast the resulting code would be, though, given the conversion of data across the Rust/Python boundary. My guess is that Python bindings would be around 10x slower.

can you give me the command to generate babylon for kRdanta-s

I'll create a script for you and update here when it's ready.

vvasuki commented 7 months ago

@akprasad I see - एध् एध् (एधँ वृद्धौ आत्मनेपदि-विधिलिङ्)<br><br>एध्येत<br>एध्येयाताम्<br>एध्येरन्<br>---<br>एध्येथाः<br>एध्येयाथाम्<br>एध्येध्वम्<br>---<br>एध्येय<br>एध्येवहि<br>एध्येमहि This can be quite confusing. So, better to produce "एध् एध् (एधँ वृद्धौ अकर्तरि विधिलिङ्)" and so on for karmaNi/bhAve.

vidyut-prakriya has Python bindings but I haven't updated them in several months. I plan to update them this week. I don't know how fast the resulting code would be, though, given the conversion of data across the Rust/Python boundary. My guess is that Python bindings would be around 10x slower.

Perhaps, if you added some function which accepted the destination file location as an argument, that function could be called efficiently from python?

vvasuki commented 7 months ago

@akprasad - also since svara rules have been implemented, it would be splendid if the entries included svaras - for example - भ꣡वति ( marking just the udAtta, instead of the more common - भव॑ति).

akprasad commented 7 months ago

So, better to produce "एध् एध् (एधँ वृद्धौ अकर्तरि विधिलिङ्)" and so on for karmaNi/bhAve.

Sure, will fix.

Perhaps, if you added some function which accepted the destination file location as an argument, that function could be called efficiently from python?

I don't want to special-case our Rust API just to support this use case. When the Python bindings are ready, we can measure how fast this logic would be through the standard API.

Also can you give me the command to generate babylon for kRdanta-s (example format

No updates here yet. Do you also want सनादि-कृदन्तs like जिज्ञासा, जिज्ञासु, ... ?

it would be splendid if the entries included svaras

Svara support is partial. We support basic cases like भ꣡वति but can't derive गोपाय॑ति. It needs much more extensive testing before I can recommend it here.

vvasuki commented 7 months ago

Also can you give me the command to generate babylon for kRdanta-s (example format

No updates here yet. Do you also want सनादि-कृदन्तs like जिज्ञासा, जिज्ञासु, ... ?

yes - but let them come together with regular kRdanta-s generated for sannanta-s.

Also, could you fix the script to not use Ric in the bookname field of the genearted babylons (Nic is more comprehensible)?