j-hagedorn / trilogy

Reference datasets for folktale motifs, tale types, and annotated texts
Other
5 stars 2 forks source link

Tag root and terminal (leaf) node in ATU datasets #40

Closed j-hagedorn closed 6 months ago

j-hagedorn commented 8 months ago

@sdaranyi asked that we: "List all terminals and roots. Group both lists into topics based on TMI labels. As L161 and L162 are frequently terminals, list all types for individual inspection where the motif string continues after them (or after any other terminal)". Planning to resolve this by doing two things:

  1. In atu_seq dataset, creating a new column identifying the first and last motif within each tale_variant. This will allow for easy filtering of these by people using the .csv file.
  2. In atu_graph.graphml file, adding two T/F fields to the node data: is_root and is_leaf

@sdaranyi and @salmonix, please note that this approach will identify the first and last items present in the data. It will be an entirely different effort to identify what "should be" a terminal node, if the ATU does not identify these in the correct sequence.

sdaranyi commented 8 months ago

Shall I pull in due course?

On Mon, 8 Jan 2024 at 08:57, Joshh @.***> wrote:

@sdaranyi https://github.com/sdaranyi asked that we: "List all terminals and roots. Group both lists into topics based on TMI labels. As L161 and L162 are frequently terminals, list all types for individual inspection where the motif string continues after them (or after any other terminal)". Planning to resolve this by doing two things:

  1. In atu_seq dataset, creating a new column identifying the first and last motif within each tale_variant. This will allow for easy filtering of these by people using the .csv file.
  2. In atu_graph.graphml file, adding two T/F fields to the node data: is_root and is_leaf

@sdaranyi https://github.com/sdaranyi and @salmonix https://github.com/salmonix, please note that this approach will identify the first and last items present in the data. It will be an entirely different effort to identify what "should be" a terminal node, if the ATU does not identify these in the correct sequence.

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/40, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNSPTYSUL5LIQU5VF3LYNORHLAVCNFSM6AAAAABBRBALUSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DSOJRGQYDEMQ . You are receiving this because you were mentioned.Message ID: @.***>

j-hagedorn commented 8 months ago

@sdaranyi , once I make the changes, this issue will be closed out and the master branch will have what you need. Please feel free to ask questions or make notes on issues you see.

sdaranyi commented 8 months ago

Related: while I find it hilarious that the structuring principle of the TMI is the English alphabet,section Q "Rewards and punishments" should be screened for any compliance with Proppians F30-31.

On Mon, 8 Jan 2024 at 09:41, Joshh @.***> wrote:

@sdaranyi https://github.com/sdaranyi , once I make the changes, this issue will be closed out and the master branch will have what you need. Please feel free to ask questions or make notes on issues you see.

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/40#issuecomment-1880577780, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNXVSXCHAEVGS67WIOLYNOWNLAVCNFSM6AAAAABBRBALUSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQGU3TONZYGA . You are receiving this because you were mentioned.Message ID: @.***>

j-hagedorn commented 6 months ago

As we approach the conference, I'm trying to prioritize efforts within this milestone. This one seems relatively simple, @sdaranyi and @salmonix, though it also seems a bit redundant since this information will already be inherent in the structure created by #42

j-hagedorn commented 6 months ago

I'm opting not to do this, since the info on node and leaf is technically in the graph dataset, and this issue just requires making a view of the nodesets which applies tidygraph::node_is_root() and tidygraph::node_is_leaf()

salmonix commented 6 months ago

You say that the generated graph using the motif-chains will carry this information even if the motif can be initial/final or in-chain in different tales? Eg.: A B C D && F G D N -> D is both terminal and in-chain. In the dataset it is clear but this may be an ambiguity in the generated graph. I agree that when we have the dataset no need to mark it - obvious -, but when generating a graph we perhaps should.

j-hagedorn commented 6 months ago

This is a great point, @salmonix . You're right that the per-tale initial and end motifs would only be obtainable by subsetting the graph. For now, I'm thinking it's not worth tagging this in the main dataset though, until we are more certain in the validity of the sequencing. For that, we're waiting for the manually updated dataset (via resolution of #46 and #45 ), though I've removed those as dependencies for the version which we'll publish for the Riga conference, to give you time.

salmonix commented 6 months ago

Y, I would not tag it in the dataset either if the position is preserved, only when generating graph (part of the code). Regardless the doubts on the validity of sequencing one use case of the data is generating sequence graphs.