j-hagedorn / trilogy

Reference datasets for folktale motifs, tale types, and annotated texts
Other
7 stars 2 forks source link

Resolve identified data cleaning issues with TMI dataset #41

Closed j-hagedorn closed 8 months ago

j-hagedorn commented 9 months ago

@sdaranyi noted that "Looking at today's pulled latest TMI, there's something wrong with the structure. Column 1 has mostly motif numbers only with labels in Column 3, but in a number of rows both the motif number and the label are present. Worse, based on this table you said there are 46222 individual motifs, whereas it has 54906 lines. Which one is the correct number? Chances are that this could change the proportion of motifs used in the ATU."

@sdaranyi , can you please provide examples where both the motif number and the label are present in the id field? That would help me to find the issue.

@sdaranyi , my tmi dataset still shows 46,222 unique values in the id field, and 46,230 rows in the dataset, with the following items being duplicated (and needing to be resolved):

image

@sdaranyi and @salmonix , please identify any additional data cleaning issues you may find related to the tmi dataset in the comments below, and I will try to resolve those quickly, if they are minor, rather than opening new issues to track them.

sdaranyi commented 9 months ago

@sdaranyi https://github.com/sdaranyi , can you please provide examples where both the motif number and the label are present in the id field? That would help me to find the issue.

In TMI, starting with rows 1613-1614, the content changes and empty rows as space are inserted. This lasts until rows 10278-10279.

Same in rows 26415, 26425, 27036, 33157, 45630-45632.

On Mon, 8 Jan 2024 at 09:13, Joshh @.***> wrote:

@sdaranyi https://github.com/sdaranyi noted that "Looking at today's pulled latest TMI, there's something wrong with the structure. Column 1 has mostly motif numbers only with labels in Column 3, but in a number of rows both the motif number and the label are present. Worse, based on this table you said there are 46222 individual motifs, whereas it has 54906 lines. Which one is the correct number? Chances are that this could change the proportion of motifs used in the ATU."

@sdaranyi https://github.com/sdaranyi , can you please provide examples where both the motif number and the label are present in the id field? That would help me to find the issue.

@sdaranyi https://github.com/sdaranyi , my tmi dataset still shows 46,222 unique values in the id field, and 46,230 rows in the dataset, with the following items being duplicated (and needing to be resolved):

image.png (view on web) https://github.com/j-hagedorn/trilogy/assets/7065685/54639ee8-f301-48de-85e4-c6f3c12da2e5

@sdaranyi https://github.com/sdaranyi and @salmonix https://github.com/salmonix , please identify any additional data cleaning issues you may find related to the tmi dataset in the comments below, and I will try to resolve those quickly, if they are minor, rather than opening new issues to track them.

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/41, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNVPXQ76FKTNEGMY4GLYNOTCXAVCNFSM6AAAAABBRBRXQOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DSOJTGQ3DGOA . You are receiving this because you were mentioned.Message ID: @.***>

j-hagedorn commented 9 months ago

Thanks, @sdaranyi

j-hagedorn commented 8 months ago

@sdaranyi this issue should be fixed too. I tested it in Excel and it works fine.