upenndigitalscholarship / deep

MIT License
3 stars 1 forks source link

TItle level in admin not being handled properly #72

Closed ZacharyLesser closed 1 year ago

ZacharyLesser commented 1 year ago

The Title level in admin is meant to encompass all the data for given play title that is identical across all of its Editions and all of the Items in those Editions.

Title --> Edition --> Item

This is why in Item, you can select which Edition it belongs to, and it will inherit all the data of that edition. And if you change the data in that Edition, that change will be inherited by all items nested within that edition.

Likewise, at the Edition level, you should be able to select a Title and the same relationship should hold: it should inherit all the data of that Title, and if you change anything in the Title, it should change for all Editions (and hence all Items) tied to that Title.

The problem right now is that there are multiple records in Title for each play. Search for Hamlet at the Title level and you get 8 results, one for each Item of this play. This is incorrect. You should get just ONE result, Hamlet, which should include the data fields exactly as they are now, but then all of the Editions of this play should be linked to, subsumed under, this ONE Title, and should inherit all that data.

Otherwise, there's no real purpose to having the Title level in admin.

ZacharyLesser commented 1 year ago

Getting this right will also help a lot when the time comes to implement the new Title View, since it will be pulling data from the Title table in admin. A lot of the work of getting that view right will be already done if we get this right, I think

apjanco commented 1 year ago

In the data model, there is a one-to-one relationship between Item and Edition. The same is true of Edition and Title. Somehow, multiple Titles and Editions are being created, likely in convert_web_jsonl.py when Edition.objects.get_or_create() and Title.objects.get_or_create() is called. This script needs to be updated. (also check that deeps.jsonl is using deep_id_revised and not other variations in the old db)

ZacharyLesser commented 1 year ago

(isn't there a many-to-one relationship between these levels in that multiple Items can be part of the same Edition, and multiple Editions can be part of the same Title?)

ZacharyLesser commented 1 year ago

Just adding: the problem I noted with Date of First Edition search, where a 2nd edition that should appear in results (and does on the old site) does not appear, seems potentially related to this. AS does the problems with Collections and with Latin plays. Perhaps if Title level is handled right, this will also resolve?

ZacharyLesser commented 1 year ago

Just to be clear:

Title (only one for each work)

includes Edition 1 Edition 2 .... Edition n

Edition 1 includes Item 1 Item 2 .... Item n

Edition 2 includes Item 1 through Item n

And so on....

apjanco commented 1 year ago

Removed deep_id from Title object. Re-imported data. Still getting some >1 Title objects given differences in alternative keywords and filter genre. Probably best to manually identify which Title object to keep, then update all (un)related Editions and Items.

Titles with multiple objects: 2 Hycke Scorner 2 Aristippus, or The Jovial Philosopher 4 The Countess of Pembroke's Arcadia 3 Comedies, Tragicomedies, and Tragedies 2 Five New Plays 2 The Masque of Blackness (The Twelfth Night's Revels) 2 The Goblins 2 1 The Passionate Lovers 3 The Works 2 Appius and Virginia 2 Antonio and Mellida 2 The Masque of Augurs 4 Poems 2 The Malcontent 2 Christ's Passion 2 Two Plays 2 Fancy's Festivals 2 The Masque at Lord Haddington's Marriage (The Hue and Cry after Cupid) 2 The Heir 2 Two New Plays 2 1 & 2 The Passionate Lovers 2 Love's Labor's Lost 2 The Monarchic Tragedies 2 Julius Caesar 2 2 The Passionate Lovers 2 The Courageous Turk, or Amurath the First 2 The Blind Beggar of Alexandria (Irus) 2 Love's Sacrifice 2 Medea 2 Hippolytus 2 The Masque of Beauty 2 Masquerade du Ciel 2 Aminta 2 The Whole Works 2 Andria

apjanco commented 1 year ago

Or I can make the import stricter to prevent the creation of duplicates. Will give that a try.

apjanco commented 1 year ago

With title and author (thank you for the suggestion Zach), we get far fewer >1 Title objects.

4 The Countess of Pembroke's Arcadia 3 Comedies, Tragicomedies, and Tragedies 3 The Works 2 Appius and Virginia 4 Poems 2 The Malcontent 2 Christ's Passion 2 Two Plays 2 Two New Plays 2 Julius Caesar 2 Medea 2 Hippolytus 2 The Whole Works 2 Andria

apjanco commented 1 year ago

Re: "Andy, where are Author Display/Filter being handled currently? Only in Title? Only in Edition? Or in both?" We currently have both. I can retire title.authors_display and just use edition.authors. Is that correct that you'd want authors associated with Edition rather than Title?

ZacharyLesser commented 1 year ago

OK, here's what should happen:

Title level: should have NO Author info

Edition Level: should have BOTH the Modern Author and the Modern Author Display fields

Item Level: should have Author (Title Page Attribution) and the new Author (Paratext) fields

ZacharyLesser commented 1 year ago

We will go through the list of duplicate Titles above and let you know which ones are true duplicates and which ones need to remain. We could do this ourselves in the db but would we inadvertently break any links with the Editions? If that's not likely, then we'll just do it ourselves.

ZacharyLesser commented 1 year ago

Should Actually Be One Title: 4 The Countess of Pembroke's Arcadia 2 The Malcontent [Greg 203] 2 Christ's Passion [Greg 579]

Should Remain Different/Multiple titles: 3 Comedies, Tragicomedies, and Tragedies [by Chapman, by Ford, by Marston] 3 The Works [by Daniel, by Jonson, by Marston] 4 Poems [by Gomersall, by Carew, by Milton, by Beaumont and Fletcher] 2 Two Plays [by Shirley, by Mayne] 2 Two New Plays [by Carlell, by Middleton] 2 The Whole Works [by Gascoigne, by Daniel] 2 Appius and Virginia [Greg 65, Greg 733] 2 Julius Caesar [Greg 261, Greg 403] 2 Medea [Greg 44, Greg 675] 2 Hippolytus [Greg 80, Greg 696] 2 Andria [Greg 12, Greg 91]

ZacharyLesser commented 1 year ago

Something has gone wonky here -- I think in reducing the Titles to one per work -- especially (or maybe only) with the ones in the list above, the distinctions between 2 works (that is, two Titles) that are different but happen to share the same title, has been lost. So now there is only one Two Plays, but it is only the one by Mayne; and at the Edition level, only the plays by Shirley in that collection have been retained, which leads to big confusion. Also, there should be quite a few Editions of The Countess of Pembroke's Arcadia, but in reducing the Titles to 1 for that title (which should only have 1, correct), also there is now only one Edition (see attached), which means results are all wrong. Screen Shot 2023-01-26 at 1.39.12 PM.pdf

ZacharyLesser commented 1 year ago

I think we will need to restore the old data and then Alan and I probably need to go through all the ones in the list above https://github.com/upenndigitalscholarship/deep/issues/72#issuecomment-1404046835 and fix them by hand.

The key thing is that there can be multiple Items associated with any single Edition, and there can be multiple Editions associated with any single Title. (Perhaps we should have called this top level "Work" instead of "Title" since it is a little confusing but 2 different Works can happen to have the same title -- e.g., Two New Plays or Andria)

apjanco commented 1 year ago

For The Countess of Pembroke's Arcadia, all records show PLAY EDITION: n/a The book edition is all 7. It's GREG is n/a Where in the data is there indication that these are different editions?

ZacharyLesser commented 1 year ago

Something has gone wrong, but it used to be correct (as I recall).

Book Edition #s for Editions of Countess of Pembroke's Arcadia should run from 3 to 11 (with edition 5 having 2 Items; Edition 6 has two Items; Edition 7 has 5 Items; Edition 8 has 3 Items; -- the others each have only 1 Item)

DEEP #s for items are 5030 - 5046 -- it looks like these are all still in the db at the Item level, but some Editions have been deleted in the process of reducing the Titles, and this has resulted in the Items being associated all with one Edition, and hence only 1 result if you search for the title.

ZacharyLesser commented 1 year ago

If you check an older version of the db, aren't there a whole bunch of Editions of this?

ZacharyLesser commented 1 year ago

It occurred to us when we got off that we could simply create a new unique identifier, the Work Identifier, that would do the job perfectly, with no outliers, of grouping all Items into the same correct Title.

This would work better than the system of using Title + Author for the reasons we discussed.

To do this, we'd need from you an excel sheet listing for each Item

DEEP # Greg Brief Collection (this is a string beginning with c, then a number then a letter: we saw it there for Countess of Pembroke Arcadia and it should exist for every Item that does not have a Greg Brier)

We would then add a new Work identifier and send the sheet back to you to import.

Work Identifier would go at the Title level and would serve as the link between items and their title

The system of play edition number, then book edition number, would work to generate Editions within these Titles.

Do you want to do this? It will produce no outliers like Arcadia where the authors change. There are enough of these that it seems to us worth doing this since it produces a genuinely unique algorithmically useful way of grouping items under the same Title

(BTW we should change Title in admin to Work, i think, to avoid confusion with the ordinary title of a play)

ZacharyLesser commented 1 year ago

Here is the spreadsheet matching every DEEP # to a Work ID

Work ID.xlsx

ZacharyLesser commented 1 year ago

In admin, at the Works level, we don't need the year following the title in the name of the Work -- this is because Works don't really have "a" year; only Editions and Items have a year attached to them. The work is printed over time in many different years, and so the only years that appear in the data associated at the Work level are Date of First [publication, performance]

So these can just say "The Phoenix" instead of "The Phoenix 1607" Screen Shot 2023-01-28 at 10.46.47 AM.pdf

ZacharyLesser commented 1 year ago

(I believe that #110 and #109 are both related to this problem and will be fixed when this is all resolved.)