j-hagedorn / trilogy

Reference datasets for folktale motifs, tale types, and annotated texts
Other
5 stars 2 forks source link

Remove tale variants from atu_seq when a motif is repeated in sequence #45

Open j-hagedorn opened 7 months ago

j-hagedorn commented 7 months ago

This is based on an issue identified by @salmonix. In the example identified, the sequences of the tale variants are as follows for tale 1341A:

  1. "J2356","J2136","J2136"
  2. "J2356","J2136","J581"
  3. "J2356","J581","J2136"
  4. "J2356","J581","J581"

The text runs as following: "...The thieves kill him, too [J581, J2136]. (3) Two foolish slaves are recaptured because of their talkativeness [J581, J2136]..." The motifs identified are: J581 (Wisdom and Folly, Foolishness Of Noise-Making When Enemies Overhear) and J2136.1 (Wisdom and Folly).

@salmonix suggests that, in our cleared data perhaps we should only retain variants 2 and 3 above, and remove 1 and 4, where the same motif is repeated in a row.

Adding this as an issue for discussion: @sdaranyi and @salmonix, are we certain that we want to remove all tale variants where a motif is repeated 2 or more times in a row?

sdaranyi commented 7 months ago

Which tale was this? Can you pls add the ATU number? (3) suggest to me that maybe we are dealing with story variants, seriously influencing plot structure, apart from minor variants. In that case we should retain wgatever we can.

This is btw the typical open ended question where we can ask for expert advice, involving those in the know. Why not speak out and spare future criticism from them. They should decide how they want to contribute to our design, and then fun could be doubled while frustration could be halved. We could identify such neuralgic points for them to join the crew.

On Wed, 24 Jan 2024 at 12:21, Joshh @.***> wrote:

This is based on an issue identified by @salmonix https://github.com/salmonix. In the example identified, the sequences of the tale variants are as follows for tale 1341A:

  1. "J2356","J2136","J2136"
  2. "J2356","J2136","J581"
  3. "J2356","J581","J2136"
  4. "J2356","J581","J581"

The text runs as following: "...The thieves kill him, too [J581, J2136]. (3) Two foolish slaves are recaptured because of their talkativeness [J581, J2136]..." The motifs identified are: J581 (Wisdom and Folly, Foolishness Of Noise-Making When Enemies Overhear) and J2136.1 (Wisdom and Folly).

@salmonix https://github.com/salmonix suggests that, in our cleared data perhaps we should only retain variants 2 and 3 above, and remove 1 and 4, where the same motif is repeated in a row.

Adding this as an issue for discussion: @sdaranyi https://github.com/sdaranyi and @salmonix https://github.com/salmonix, are we certain that we want to remove all tale variants where a motif is repeated 2 or more times in a row?

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/45, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNTTVNPW67Q6NGDL4ATYQDVDPAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TQMBUHA4TQMY . You are receiving this because you were mentioned.Message ID: @.***>

sdaranyi commented 7 months ago

Yes, exactly the problem I had in mind. Until we devise a plan on how to deal with variants -- based on the AaTh btw --, we cannot resolve this. Now off to lunch before my next telco.

j-hagedorn commented 7 months ago

This is tale 1341A. I would be comfortable tagging this and potentially other questions with a 'question' tag and suggesting to experts that this is one way they could contribute to the dataset. I'd want to remove it from the milestone of things we want to resolve prior to initial publishing of the dataset, before the conference.

sdaranyi commented 7 months ago

Exactly. Decision point no 1. But then we should also exclude these types from the string set as well — with 68 K at hand we can afford delegating such problems to the wise, thereby making them co-own the effort.

On Wed, 24 Jan 2024 at 12:34, Joshh @.***> wrote:

This is tale 1341A. I would be comfortable tagging this and potentially other questions with a 'question' tag and suggesting to experts that this is one way they could contribute to the dataset. I'd want to remove it from the milestone of things we want to resolve prior to initial publishing of the dataset, before the conference.

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/45#issuecomment-1907950269, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNTJXGPPJOTCFEYQAZTYQDWUFAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXHE2TAMRWHE . You are receiving this because you were mentioned.Message ID: @.***>

j-hagedorn commented 7 months ago

Yes, exactly the problem I had in mind. Until we devise a plan on how to deal with variants -- based on the AaTh btw --, we cannot resolve this. Now off to lunch before my next telco.

@sdaranyi , if you are certain that this is a problem, it would not be difficult to remove such occurrences from the dataset. I just want to ensure that it truly is a problem. I.e. that we can reasonably expect that motifs do not occur twice in a row with enough frequency to retain such instances in the permutations of sequences we generate.

sdaranyi commented 7 months ago

As long as we know the ones we are excluding for this reason (which means that at some point they will be welcome back), I don't see a problem.

I am not certain that variants are the source of this problem, but understand Uther like that. Aarne and Thompson were more generous with variants, but Uther merged them and generalized his shorthand to a next interpretation level, suggesting a more abstract common content denominator, only he knows why. It may have worked for the profession until now, but clearly variant strings must be separated, not merged.

On Wed, 24 Jan 2024 at 12:38, Joshh @.***> wrote:

Yes, exactly the problem I had in mind. Until we devise a plan on how to deal with variants -- based on the AaTh btw --, we cannot resolve this. Now off to lunch before my next telco.

@sdaranyi https://github.com/sdaranyi , if you are certain that this is a problem, it would not be difficult to remove such occurrences from the dataset. I just want to ensure that it truly is a problem. I.e. that we can reasonably expect that motifs do not occur twice in a row with enough frequency to retain such instances in the permutations of sequences we generate.

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/45#issuecomment-1907955538, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNUVU5DMUBZT2A7NPOLYQDXBVAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXHE2TKNJTHA . You are receiving this because you were mentioned.Message ID: @.***>

salmonix commented 7 months ago

I put my daughter on to revise these cases. She will do it in this month and I will guide her. For now I would leave the repetitions out for 2 reasons:

  1. if they are recursive elements that, imo, would not add much to the key points of the tale structures. Like: performing 3 tasks instead of one.
  2. if they are due to parsing error - as the data is recorded with ambiguity - it should be out. we lose important information only if we eliminate sequences as A,B and B,A where A and B both can be terminal. (resolution: punish the evil stepmother and marry the girl OR marry the girl and punish stepmom) But I have not much seen that. I would leave it out now till we have a bit manually revised data.

Also regarding contracting the numbers to 2 digits (as K1076.2.3 -> K1076.2 , for instance) : It seems that most of the time it is similar like referring to motives by a superclass tag instead of the particular motif. Like saying: 'bringing out an object' instead of 'bringing out a mirror' and 'bringing out a mortar'. However, as I see that in some cases this categorization is wrong and it may lead to errors. This case what I can imagine is: -> let's take the full token with all the digits. K1076.2.3 -> make graph 1 -> reduce the digits K1076.2 -> make graph 2 Compare the two graphs if they have the same main base characteristics. If yes, we know that K1076.2.3 can be substituted with K1076.2. That would be the theory.

On Wed, Jan 24, 2024 at 1:08 PM sdaranyi @.***> wrote:

As long as we know the ones we are excluding for this reason (which means that at some point they will be welcome back), I don't see a problem.

I am not certain that variants are the source of this problem, but understand Uther like that. Aarne and Thompson were more generous with variants, but Uther merged them and generalized his shorthand to a next interpretation level, suggesting a more abstract common content denominator, only he knows why. It may have worked for the profession until now, but clearly variant strings must be separated, not merged.

On Wed, 24 Jan 2024 at 12:38, Joshh @.***> wrote:

Yes, exactly the problem I had in mind. Until we devise a plan on how to deal with variants -- based on the AaTh btw --, we cannot resolve this. Now off to lunch before my next telco.

@sdaranyi https://github.com/sdaranyi , if you are certain that this is a problem, it would not be difficult to remove such occurrences from the dataset. I just want to ensure that it truly is a problem. I.e. that we can reasonably expect that motifs do not occur twice in a row with enough frequency to retain such instances in the permutations of sequences we generate.

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/45#issuecomment-1907955538,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/ARZDKNUVU5DMUBZT2A7NPOLYQDXBVAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXHE2TKNJTHA>

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/45#issuecomment-1908001241, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKXK5SOH5GSIMSVFYXJ5DYQD2UJAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBYGAYDCMRUGE . You are receiving this because you were mentioned.Message ID: @.***>

j-hagedorn commented 7 months ago

Thanks @salmonix and thanks to your daughter. Will she be using R or Python? Let me know if you have thoughts about how best to integrate the changed code into the existing codebase. Regarding your comment on reducing the digits, I've made an issue over in our other repo.

salmonix commented 7 months ago

Emma will annotate the text manually, so we can re-parse it searching for a given pattern. Gonna be an additional line right below the tale with a tag and the motives extracted. So far the line will look like this (for the tale 1692 as example):

ANN: 1692, J2136, J2461.1.7, J2461.1.7.1, [T:

J2136.5.6,J2136.5.7,J2136.5.5]

T: means tail variants. We also thought of an other tag, like [R: motif 1, motif 2] marking that the motives can be reversible. This cleanup will strictly focus on understanding the human text of ATU.

I also thought of adding subjective tag, maybe a separate line. Eg. in many tales the motives are interchangeable. In the example above I can imagine that J2461.1.7, J2461.1.7.1 are two motives (the mortar and the mirror) and their order actually does not really matter. It is put into the catalogue as is, but as human readers knowing intuitively how stories run know, that here it does not matter. So, maybe we can add one more line, like ## ANN with this version. would stand for 'reconstructed' version, like in linguistics.

Any further ideas welcome. We would have a manual check and let's use it for the best result. She jumps into it from next week on. Yeah, she is on my cost partly anyway, so a good reason to pay her from my company. :D

On Thu, Jan 25, 2024 at 2:32 AM Joshh @.***> wrote:

Thanks @salmonix https://github.com/salmonix and thanks to your daughter. Will she be using R or Python? Let me know if you have thoughts about how best to integrate the changed code into the existing codebase. Regarding your comment on reducing the digits, I've made an issue over in our other repo https://github.com/j-hagedorn/folktale_dna/issues/5.

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/45#issuecomment-1909198933, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKXKYWVE2MQOPHISJKL3DYQGY2HAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGE4TQOJTGM . You are receiving this because you were mentioned.Message ID: @.***>

j-hagedorn commented 7 months ago

That's great news, @salmonix . Just to be clear, she will be going through the original .txt file and manually producing a .csv file? From the example you give, it sounds as though that file's structure will look a lot like this:

image

...which is the structure of the current script at this point. That's nice because we can pretty easily apply the remainder of the logic to produce a variant of atu_seq based on the more accurate manual annotations. Actually, I'm thinking that her version will become the primary, and we'll just archive the other.

Notes on method

Impact on current issues

As I see it, the creation and subsequent flattening of the manually-annotated file will allow us to close #45 (this one), #44 (since the main need for the AT was it's more logical structure), and #46 (since Emma will manually applying consistent structure to denote variants). With the tale sequence laid out clearly and accurately, it should be easy to close #40 as well. That's great!

salmonix commented 7 months ago

She will add the annotation to the source text file.

Note, that the tagging will only be about tales where the text is ambiguous. Where it is straightforward, we just parse as now.

On Thu, Jan 25, 2024 at 12:49 PM Joshh @.***> wrote:

That's great news, @salmonix https://github.com/salmonix . Just to be clear, she will be going through the original .txt file and manually producing a .csv file? From the example you give, it sounds as though that file's structure will look a lot like this:

image.png (view on web) https://github.com/j-hagedorn/trilogy/assets/7065685/a1a73b2c-473d-4db9-a5e0-39db339a59dc

...which is the structure of the current script at this point https://github.com/j-hagedorn/trilogy/blob/master/fetch/fetch_taletypes.R#L101. That's nice because we can pretty easily apply the remainder of the logic to produce a variant of atu_seq based on the more accurate manual annotations. Actually, I'm thinking that her version will become the primary, and we'll just archive the other. Notes on method

  • If in sequence, no need for terminal tag. If she is removing or re-ordering the motifs in such a way that the final motif is the sequence is always the one which occurs last in the story narrative, then we don't need her to note that it is the terminal motif, since that will be clear from the structure.
  • Switched orders are variants? For the example that you gave, if they both occur and are distinct motifs, then I'm not sure what you mean by saying that "their order actually does not really matter". If both orderings occur, then wouldn't these be discrete variants of the tale?
  • Resolving unforeseen challenges. I imagine that she will encounter a number of questions that we can't foresee. Will she be noting these and e-mailing us about them?

Impact on current issues

As I see it, the creation and subsequent flattening of the manually-annotated file will allow us to close #45 https://github.com/j-hagedorn/trilogy/issues/45 (this one), #44 https://github.com/j-hagedorn/trilogy/issues/44 (since the main need for the AT was it's more logical structure), and #46 https://github.com/j-hagedorn/trilogy/issues/46 (since Emma will manually applying consistent structure to denote variants). With the tale sequence laid out clearly and accurately, it should be easy to close #40 https://github.com/j-hagedorn/trilogy/issues/40 as well. That's great!

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/45#issuecomment-1910020925, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKXKZTOSVBXOBXT36K5U3YQJBFFAVCNFSM6AAAAABCISRC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGAZDAOJSGU . You are receiving this because you were mentioned.Message ID: @.***>

j-hagedorn commented 6 months ago

@salmonix , how is the manually-annotated file going? Does Emma have any questions that @sdaranyi or I can help with?

j-hagedorn commented 6 months ago

As we approach the conference, I'm trying to prioritize efforts within this milestone. This one seems like High value, but also High effort so I'm not doing work on it and assuming that its completion will be contingent upon the manual annotation, @salmonix .