Resolve non-valid ATU IDs in AFT dataset

j-hagedorn commented 7 months ago

Pasting your e-mail to track its resolution, @sdaranyi
"Hi Josh, I started to work with the AFT to exemplify on an existing corpus how things would look if people started using an existing Python workflow in Orange...:

I am using the aft.csv (21-03-25 in my folder -- pretty old, if the current version has mended the problem to be described below, please ignore this);
It has atu-Id values well beyond 2400, the upper limit of tale types in the ATU. With not even the list of Discontinued types containing identifiers between 3000-7080, I simply have no idea where Ashliman may have gotten these types from, but chances are that he used something else, not the ATU;
The column type_identifier contains strings not found in the ATU. No matches whatsoever, so again the question is if Ashliman used an older typology or his own."

sdaranyi commented 7 months ago

I have checked itnow vs the AT, there the last type is No 2411. So those numbers come from elsewhere. But we already noticed something problematic there, cf my file in the Trilogy local folder entitled 'fishy aft ids_ill matches with atu.xlsx' albeit with a much shorter list -- have you got a copy of that? Now I can see that it was discovered while you prepared Fig 1 for JOHD, and we guessed that these must be Christiansen’s tale types (Christiansen, 1992, nowhere a downloadable copy). Bingo: see the aft.xlsx (22-02-27) where in row 1192 it says: "Link to additional Fairy Cup Legends (migratory legends of Christiansen type 6045 and other stories of drinking vessels stolen from or abandoned by fairies)." This will be a good-to-know limitation.

j-hagedorn commented 7 months ago

Hey @sdaranyi I looks like this issue is resolved in the master version of the repository. From your comments above, it seems as though you may be referencing older file versions, which I'd recommend you get rid of. When I check the current datasets:

atu_df <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/master/data/atu_df.csv")
aft <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/master/data/aft.csv")
unmatched <- aft %>% anti_join(atu_df %>% select(atu_id,tale_name), by = "atu_id")

Then the number of unmatched IDs between the aft and atu_df datasets (i.e. nrow(unmatched)) = 0. So, the aft in it's published version does not contain those Christiansen references. I'm marking this issue as closed, but please let me know if you see anything else that looks odd! I'm grateful to have other eyes on these datasets for quality control.

sdaranyi commented 7 months ago

Hi Josh, thanks. I've downloaded both now. The aft.csv upon import in Excel has extra empty rows from row 1476 onwards, which could be unpleasant for processing. The atu_df.csv, on the other hand, is littered with question marks in black rombuses, see below. How can I get rid of them?

[image: image.png]

As soon as time shall permit I'll redo the respective experiments in Orange.

On Mon, 19 Feb 2024 at 01:32, Joshh @.***> wrote:

Hey @sdaranyi https://github.com/sdaranyi I looks like this issue is resolved in the master version of the repository. From your comments above, it seems as though you may be referencing older file versions, which I'd recommend you get rid of. When I check the current datasets:

atu_df <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/master/data/atu_df.csv") aft <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/master/data/aft.csv") unmatched <- aft %>% anti_join(atu_df %>% select(atu_id,tale_name), by = "atu_id")

Then the number of unmatched IDs between the aft and atu_df datasets (i.e. nrow(unmatched)) = 0. So, the aft in it's published version does not contain those Christiansen references. I'm marking this issue as closed, but please let me know if you see anything else that looks odd! I'm grateful to have other eyes on these datasets for quality control.

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/48#issuecomment-1951509105, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNRYNYEM2XODJBZ4UF3YUKMRBAVCNFSM6AAAAABDJNMXUOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJRGUYDSMJQGU . You are receiving this because you were mentioned.Message ID: @.***>

sdaranyi commented 6 months ago

Actually something stranger is happening (unless my laptop is making fun of me). From row/line 1476 in the aft.cv, the atu_ids disappear and some story is appearing in every second line, stripped of any recognizable cues.

On Mon, 19 Feb 2024 at 01:32, Joshh @.***> wrote:

Hey @sdaranyi https://github.com/sdaranyi I looks like this issue is resolved in the master version of the repository. From your comments above, it seems as though you may be referencing older file versions, which I'd recommend you get rid of. When I check the current datasets:

atu_df <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/master/data/atu_df.csv") aft <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/master/data/aft.csv") unmatched <- aft %>% anti_join(atu_df %>% select(atu_id,tale_name), by = "atu_id")

Then the number of unmatched IDs between the aft and atu_df datasets (i.e. nrow(unmatched)) = 0. So, the aft in it's published version does not contain those Christiansen references. I'm marking this issue as closed, but please let me know if you see anything else that looks odd! I'm grateful to have other eyes on these datasets for quality control.

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/48#issuecomment-1951509105, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNRYNYEM2XODJBZ4UF3YUKMRBAVCNFSM6AAAAABDJNMXUOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJRGUYDSMJQGU . You are receiving this because you were mentioned.Message ID: @.***>

j-hagedorn commented 6 months ago

From row/line 1476 in the aft.cv, the atu_ids disappear and some story is appearing in every second line, stripped of any recognizable cues.

My guess is that this is due to some part of Excel's settings for importing the .csv. For instance, the commas inside of the text quotes might be causing line returns. I can look into it and see if there's something I can do with encoding the file.

The atu_df.csv, on the other hand, is littered with question marks in black rombuses, see below.

I'll flag this as another cleanup step in #42

j-hagedorn commented 6 months ago

Reopening this issue due to a few e-mail requests from @sdaranyi , to allow for working with data in Orange.

sdaranyi commented 6 months ago

Thanks Josh!

On Tue, 5 Mar 2024 at 19:27, Joshh @.***> wrote:

Reopening this issue due to a few e-mail requests from @sdaranyi https://github.com/sdaranyi , to allow for working with data in Orange.

— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/48#issuecomment-1979386641, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNWP6XPIGJAPBPJBAEDYWYE7PAVCNFSM6AAAAABDJNMXUOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZZGM4DMNRUGE . You are receiving this because you were mentioned.Message ID: @.***>

j-hagedorn commented 6 months ago

@sdaranyi this issue should be fixed. I tested it in Excel and it works fine.

sdaranyi commented 6 months ago

Confirmed, it works, good sandbox level results. One way to showcase the approach -- build and equip your own castle, bake your own delicious mudcakes. :-) Below attached please find the joint semantic-sentiment space of the AFT for a first try. AFT joint space 24-03-06_ACM kriging

j-hagedorn commented 6 months ago

Cool. Can you let me know what algorithms/methods you're using in Orange, so that I can test them out using reproducible (code-based) approaches?

j-hagedorn / trilogy

Resolve non-valid ATU IDs in AFT dataset #48