gutenbergtools / autocat3

CherryPy App that serves dynamic content for Project Gutenberg
GNU General Public License v3.0
2 stars 6 forks source link

Landing pages and search results should display Title, not Uniform Title #77

Closed gbnewby closed 3 years ago

gbnewby commented 3 years ago

There is a little logic in a few places in autocat3 that displays the uniform title rather than the title, when a uniform title exists. I see this in templates/bibrec.html and AdvSearchPage.py.

This is frequently reported as a problem/anomaly by our users (and cataloger). The situation is that a title search will yield results in a different language, and therefore no visibly matching title words.

Another situation is when an author landing page for an English book displays the title in a language other than English. For example, https://www.gutenberg.org/ebooks/author/85 .. you can see a listing for "Quatrevingt-treize. English." This is using the uniform title (field 245, I think). But then the landing page correctly uses the title field (240, I think): https://www.gutenberg.org/ebooks/49372

In short, I concur with our cataloger that we should always display titles, not uniform titles. Titles are more "correct" for the actual book contents. Uniform titles might be good to display as a field in the bibrec section of a landing page, but are not appropriate for search results.

I note that the specific author landing page for Victor Hugo does not come directly from autocat3 (it's a nightly cron job). But here is a quick search yielding the exact same behavior: https://www.gutenberg.org/ebooks/search/?query=a.victor+hugo&submit_search=Go%21 .. the nightly cron job leverages the same logic (I can help track that down, if needed, but fixing in autocat3 might also fix the cron job).

eshellman commented 3 years ago

I've looked at fixing this before. The issue is the titles are stored in the attributes table and can't really be returned with the author query as the uniform title can be (maybe they can, but I'd sooner re-write the pages using ORM code and take some performance hit). So instead of a single query, an additional query must be made for every title returned. The usual ways for highly trafficked websites to deal with this are: to add a field to the books table; to populate the results asynchronously; or to shard or beef up the database. Now it may well be that the relevant result pages are small and infrequent, so performance doesn't matter - if we extract some numbers from the logs this would help inform the development work, but in any case this would need re-writing a page and load testing - a couple days of work.

gbnewby commented 3 years ago

I think it's worthwhile to pursue. This has been frequently reported as a problem by our readers (i.e., often multiple times per month).When people see non-English results it looks like something is wrong with our system.

I do like the idea of doing it via ORM and page rewrite, since that is hopefully our way forward with this application. The performance hit may matter for search results, but not for static landing pages.

On Sat, May 8, 2021 at 12:15 PM Eric Hellman @.***> wrote:

I've looked at fixing this before. The issue is the titles are stored in the attributes table and can't really be returned with the author query as the uniform title can be (maybe they can, but I'd sooner re-write the pages using ORM code and take some performance hit). So instead of a single query, an additional query must be made for every title returned. The usual ways for highly trafficked websites to deal with this are: to add a field to the books table; to populate the results asynchronously; or to shard or beef up the database. Now it may well be that the relevant result pages are small and infrequent, so performance doesn't matter - if we extract some numbers from the logs this would help inform the development work, but in any case this would need re-writing a page and load testing - a couple days of work.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/autocat3/issues/77#issuecomment-835478648, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLT4FOQU5YTAKFXVOBTTMWEVLANCNFSM44HSLIVA .

eshellman commented 3 years ago

So you realize that switching the display will result in a ridiculous-looking alphabetization? And switching the field used for alphabetization will cause complaints about sorting "The Aristocrats" under "T" instead of "A". I don't think you can win here.

eshellman commented 3 years ago

Yesterday: 7430 | 0.20% | 0.02 |   | /ebooks/authors/search/

so about 5 author searches a minute

gbnewby commented 3 years ago

This still seems worthwhile. Agreed that sorting might need additional consideration (or an asterisk to explain).

Via our cataloger:

I went and looked at the page where this is being discussed, and noticed a couple of issues:

Title is field 245 Uniform title is field 240

There's discussion of how this is going to screw up sorting. Sorting should ALSO be done on the title field (245) and NOT the uniform title field (240).

I think all bib records that have a 240 should ALSO have a 245, so I don't see why there would be difficulties about just doing everything title-display-related and title-sort-related with the 245. If someone can point me to records that have 240 fields but lack 245 fields, I will fix them so that we don't have that situation any more.

Both 240 and 245 (and also 505) fields should be treated as matches for title searches.

From the MARC Attribute usage page available from the Catalog Admin page, here are the number of times each of these fields is used in the catalog.

240 - Uniform title: 1778 245 - Title Statement: 65229

So I would be very surprised if the catalog database is set up so that 240 Uniform Title is somehow more efficient to use.

I hope this helps,

On Mon, May 10, 2021 at 9:25 AM Eric Hellman @.***> wrote:

Yesterday: 7430 | 0.20% | 0.02 | | /ebooks/authors/search/

so about 5 author searches a minute

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/autocat3/issues/77#issuecomment-836919745, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLT7SOSHTSKDFCKPCZTTNACITANCNFSM44HSLIVA .

eshellman commented 3 years ago

OK, I looked at this again. The title sort was not what I thought.

Here's the function that creates the title for sorting:

`    SELECT ltrim (substring (attributes.text FROM attributes.nonfiling))
       FROM attributes
       WHERE attributes.fk_books = $1
       AND attributes.fk_attriblist = ANY (ARRAY[240, 245, 246])
       ORDER BY attributes.fk_attriblist
       LIMIT 1;
`

so Uniform title (240) is used when it exists, just because 240 < 245 !

It's in the SQL Schema. So this has nothing to do with Autocat3! It should be trivial to change for someone familiar with Postgres SQL, and then somehow rebuild the database.

Closing the issue because it can't be addressed with anything in the repo. Probably would be a good idea to put the database schema in a new repo.

gbnewby commented 3 years ago

The cited SQL code is part of autocat3, so I'd rather leave this issue open. We can tag it "won't fix" for now.

I'll try to get support in adjusting the SQL, or refactoring the database. I think we just need a slightly more sophisticated query.

eshellman commented 3 years ago

No, it's not part of autocat3. It might possibly be part of the cataloguing tool. The right way to fix this is to change the schema, most likely that's not been committed anywhere. Our query is fine. I know of someone who might be willing to help.

eshellman commented 3 years ago

I will create a repo for the sql

eshellman commented 3 years ago

fix applied to production