sebastian-nehrdich / byt5-sanskrit-analyzers

5 stars 1 forks source link

Max length for segmentation-lemma-tagging? #1

Open aso2101 opened 1 month ago

aso2101 commented 1 month ago

I've tried segmentation-lemma-tagging/run_inf.py with various modes on the following sentence:

āsīdaśeṣanarapatiśiraḥsamarcitaśāsanaḥ pākaśāsana ivāparacaturudadhimālāmekhalāyā bhuvo bhartā pratāpānurāgāvanatasamastasāmantacakraścakravartilakṣaṇopetaścakradhara iva karakamalopalakṣyamāṇaśaṅkhacakralāñchano hara iva jitamanmatho guha ivāpratihataśaktiḥ kamalayoniriva vimānīkṛtarājahaṃsamaṇḍalo jaladhiriva lakṣmīprasūtirgaṅgāpravāha iva bhagīrathapathapravṛtto raviriva pratidivasopajāyamānodayo meruriva sakalabhuvanopajīvyamānapādacchāyo diggaja ivānavaratapravṛttadānādrīkṛtakaraḥ kartā mahāścaryāṇāmāhartā kratūnāmādarśaḥ sarvaśāstrāṇāmutpattiḥ kalānāṃ kulabhavanaṃ guṇānāmāgamaḥ kāvyāmṛtarasānāmudayaśailo mitramaṇḍalasyotpātaketurahitajanasya pravartayitā goṣṭhībandhānāmāśrayo rasikānāṃ pratyādeśo dhanuṣmatāṃ dhaureyaḥ sāhasikānāmagraṇīrvidagdhānāṃ vainateya iva vinatānandajanano vainya iva cāpakoṭisamutsāritasakalārātikulācalo rājā śūdrako nāma.

(the first sentence of Bāṇa's Kādambarī). The results are cut off around the word upeta (the 24th word in segmentation) in segmentation-lemma-morphosyntax, and rājahaṃsa in lemma-morphosyntax (the 44th word) and pravṛtta in segmentation alone (55th). These limits are the same for txt and tsv output. Is there a reason why the analyzer seems to stop before the end of the input?

sebastian-nehrdich commented 1 month ago

Yes, the model by design can only handle phrases up to 512 characters in generation length (which is about 250 characters input length) one easy solution is to define some break point in the text and “stitch “ it back together later, that’s what we do on the dharmamitra website. If it’s important to you I can add this to the codebase here as well. It might be better though to provide some designated preprocessing to the text input, for example to make sure that every line doesn’t have more than 250 characters.

On Tue, Oct 8, 2024 at 8:38 PM Andrew Ollett @.***> wrote:

I've tried segmentation-lemma-tagging/run_inf.py with various modes on the following sentence:

āsīdaśeṣanarapatiśiraḥsamarcitaśāsanaḥ pākaśāsana ivāparacaturudadhimālāmekhalāyā bhuvo bhartā pratāpānurāgāvanatasamastasāmantacakraścakravartilakṣaṇopetaścakradhara iva karakamalopalakṣyamāṇaśaṅkhacakralāñchano hara iva jitamanmatho guha ivāpratihataśaktiḥ kamalayoniriva vimānīkṛtarājahaṃsamaṇḍalo jaladhiriva lakṣmīprasūtirgaṅgāpravāha iva bhagīrathapathapravṛtto raviriva pratidivasopajāyamānodayo meruriva sakalabhuvanopajīvyamānapādacchāyo diggaja ivānavaratapravṛttadānādrīkṛtakaraḥ kartā mahāścaryāṇāmāhartā kratūnāmādarśaḥ sarvaśāstrāṇāmutpattiḥ kalānāṃ kulabhavanaṃ guṇānāmāgamaḥ kāvyāmṛtarasānāmudayaśailo mitramaṇḍalasyotpātaketurahitajanasya pravartayitā goṣṭhībandhānāmāśrayo rasikānāṃ pratyādeśo dhanuṣmatāṃ dhaureyaḥ sāhasikānāmagraṇīrvidagdhānāṃ vainateya iva vinatānandajanano vainya iva cāpakoṭisamutsāritasakalārātikulācalo rājā śūdrako nāma.

(the first sentence of Bāṇa's Kādambarī). The results are cut off around the word upeta (the 24th word in segmentation) in segmentation-lemma-morphosyntax, and rājahaṃsa in lemma-morphosyntax (the 44th word) and pravṛtta in `segmentation' alone (55th). These limits are the same for txt and tsv output. Is there a reason why the analyzer seems to stop before the end of the input?

— Reply to this email directly, view it on GitHub https://github.com/sebastian-nehrdich/byt5-sanskrit-analyzers/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEPC7GBR3QQVMRK5BUBCZOLZ2SQJXAVCNFSM6AAAAABPTSLYSOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGU3TINRUGM3TKNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

aso2101 commented 1 month ago

Okay, that makes good sense, and I'll make sure the input data is broken up. This is a naive question, but will different breaks within a sentence affect the segmentation/lemmatization/analysis?

sebastian-nehrdich commented 1 month ago

There is likely a small effect in its quality, but it shouldn't be too pronounced. Ideally of course it is to break on whitespace, as breaking in the middle of a long compound will almost certainly introduce problems...