j-hagedorn / trilogy

Reference datasets for folktale motifs, tale types, and annotated texts
Other
7 stars 2 forks source link

Find a way to merge the ATU with the AT #44

Closed sdaranyi closed 8 months ago

sdaranyi commented 9 months ago

The AT Uther updated had useful extra information ignored or condensed by the ATU, prominently on the motif strings of type variants. To integrate both sources could help us disambiguate too long motif chains, lasting often beyond the usual terminal. See eg suspicious cases where L161 is non-terminal, although quite often it is.

j-hagedorn commented 9 months ago

Okay, this sounds good @sdaranyi . I think we'll need to clean and import the AT as a first step, and then think about how best to integrate it. If you can find/send the best machine readable copy you have, we can start there.

sdaranyi commented 9 months ago

It was uploaded for you on Google Drive. The one I showed you, in pdf, is the best anybody's got.

On Mon, 8 Jan 2024 at 14:48, Joshh @.***> wrote:

Okay, this sounds good @sdaranyi https://github.com/sdaranyi . I think we'll need to clean and import the AT as a first step, and then think about how best to integrate it. If you can find/send the best machine readable copy you have, we can start there.

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/44#issuecomment-1881046514, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNS3V3Z5TXU5GCHO3JLYNP2MHAVCNFSM6AAAAABBRH75DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBRGA2DMNJRGQ . You are receiving this because you were mentioned.Message ID: @.***>

salmonix commented 9 months ago

There are various parsing issues.

  1. textual form of variants instead of expected [ xxx, yyy ] tagging. That may lead to 'terminals' end up as non-terminals.
  2. ambiguity in the meaning of [ xxx, yyy ] tagged variants. It is sometimes not clear to me what the semantic relation is of the elements in the [ ]. Sometimes feels like synonyms, but then these are not really variants.
  3. repeated elements - sometimes elements are unnecessarily repeated. In a mail sent around I gave some example.

I would try to use this algo for cleaning: remove the minors from the motif numbering ( J1234.5.5.6 -> J1234.5 ) and after that remove the repeated elements. That may break real repetitions but eliminate noise. Due to the ambiguity cleaning up the data needs manual help, checking the cases when the above algorithm would apply. Also, that would reduce the motif chains to less but still meaningful elements.

sdaranyi commented 9 months ago

We have a lot to reduce from, 68 K. Worth trying even if we call it an experiment.

On Tue, 16 Jan 2024 at 19:58, Laszlo Forro @.***> wrote:

There are various parsing issues.

  1. textual form of variants instead of expected [ xxx, yyy ] tagging. That may lead to 'terminals' end up as non-terminals.
  2. ambiguity in the meaning of [ xxx, yyy ] tagged variants. It is sometimes not clear to me what the semantic relation is of the elements in the [ ]. Sometimes feels like synonyms, but then these are not really variants.
  3. repeated elements - sometimes elements are unnecessarily repeated. In a mail sent around I gave some example.

I would try to use this algo for cleaning: remove the minors from the motif numbering ( J1234.5.5.6 -> J1234.5 ) and after that remove the repeated elements. That may break real repetitions but eliminate noise. Due to the ambiguity cleaning up the data needs manual help, checking the cases when the above algorithm would apply. Also, that would reduce the motif chains to less but still meaningful elements.

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/44#issuecomment-1894338588, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNRQK2NPBY3GYDYNGRTYO3EVJAVCNFSM6AAAAABBRH75DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGMZTQNJYHA . You are receiving this because you were mentioned.Message ID: @.***>

sdaranyi commented 9 months ago

For understanding and ranking the results, given a set of test examples, the listing of type content could be the first step:

E.g. 510A Cinderella.

(...)

[N711.6, N711.4] --> N711

N711.6. /Prince sees heroine at ball and is enamored./ N711.4. †N711.4. /Prince sees maiden at church and is enamored./

        --> N711. /King (prince) accidentally finds maiden and marries

her./ (...)

Uther seems to have looked for the conceptual common denominator when bracketing different motifs and thereby having created forkings/type variants by different strings. The reduced form expresses this denominator, a hypernym expressed by a sentence (sort of).

If we list the motifs in a chain, and there's practically no difference between the original and the reduced variant in content, ie the story is what it used to be with only one -- maybe far too little -- detail lost, we have the proof of the pudding for anyone with practical usability in mind.

On Tue, 16 Jan 2024 at 19:58, Laszlo Forro @.***> wrote:

There are various parsing issues.

  1. textual form of variants instead of expected [ xxx, yyy ] tagging. That may lead to 'terminals' end up as non-terminals.
  2. ambiguity in the meaning of [ xxx, yyy ] tagged variants. It is sometimes not clear to me what the semantic relation is of the elements in the [ ]. Sometimes feels like synonyms, but then these are not really variants.
  3. repeated elements - sometimes elements are unnecessarily repeated. In a mail sent around I gave some example.

I would try to use this algo for cleaning: remove the minors from the motif numbering ( J1234.5.5.6 -> J1234.5 ) and after that remove the repeated elements. That may break real repetitions but eliminate noise. Due to the ambiguity cleaning up the data needs manual help, checking the cases when the above algorithm would apply. Also, that would reduce the motif chains to less but still meaningful elements.

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/44#issuecomment-1894338588, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNRQK2NPBY3GYDYNGRTYO3EVJAVCNFSM6AAAAABBRH75DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGMZTQNJYHA . You are receiving this because you were mentioned.Message ID: @.***>

j-hagedorn commented 8 months ago

As we approach the conference, I'm trying to prioritize efforts within this milestone. This one seems like High effort, Uncertain value, so I'm de-prioritizing it unless you say otherwise, @sdaranyi and @salmonix .

sdaranyi commented 8 months ago

I agree with all your effort/value estimates, please proceed per your convenience.

On Wed, 6 Mar 2024 at 00:04, Joshh @.***> wrote:

As we approach the conference, I'm trying to prioritize efforts within this milestone https://github.com/j-hagedorn/trilogy/milestone/4. This one seems like High effort, Uncertain value, so I'm de-prioritizing it unless you say otherwise, @sdaranyi https://github.com/sdaranyi and @salmonix https://github.com/salmonix .

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/44#issuecomment-1979786934, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNSHKC3S3YYCWOYW5TTYWZFOXAVCNFSM6AAAAABBRH75DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZZG44DMOJTGQ . You are receiving this because you were mentioned.Message ID: @.***>

salmonix commented 8 months ago

Hi, Josh, after clearing up my times and between searching for a new job I am lucky to announce that I can focus here more. :D https://github.com/j-hagedorn/trilogy/issues/43 What do you mean completing the data dictionary?I lost track of it. Let me know and I can jump on it. Still in Python, though,my R has not improved an inch. Let's use Haskell... :D Yours, Laszlo

On Wed, Mar 6, 2024 at 8:58 AM sdaranyi @.***> wrote:

I agree with all your effort/value estimates, please proceed per your convenience.

On Wed, 6 Mar 2024 at 00:04, Joshh @.***> wrote:

As we approach the conference, I'm trying to prioritize efforts within this milestone https://github.com/j-hagedorn/trilogy/milestone/4. This one seems like High effort, Uncertain value, so I'm de-prioritizing it unless you say otherwise, @sdaranyi https://github.com/sdaranyi and @salmonix https://github.com/salmonix .

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/44#issuecomment-1979786934,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/ARZDKNSHKC3S3YYCWOYW5TTYWZFOXAVCNFSM6AAAAABBRH75DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZZG44DMOJTGQ>

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/44#issuecomment-1980282751, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKXKZBB7AV4KDSOYRZZ6LYW3EBRAVCNFSM6AAAAABBRH75DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQGI4DENZVGE . You are receiving this because you were mentioned.Message ID: @.***>