uchicago-library / mlc_ucla

GNU General Public License v3.0
0 stars 0 forks source link

Check OLA browses against DMA browses #35

Closed dbietila closed 5 months ago

dbietila commented 6 months ago

Please compare OLA browses to DMA browses to ensure that we are indexing needed data appropriately.

Language: https://dma-test.lib.uchicago.edu/browse/?type=language https://dma.uchicago.edu/browse/languages

Location: https://dma-test.lib.uchicago.edu/browse/?type=location https://dma.uchicago.edu/locations

In particular, there seem to be far fewer locations browseable in the OLA interface, but there are also some disparities in languages.

johnjung commented 5 months ago

Right now the system is set up to only display place names when we are able to get a label for a TGN identifier. For example, the dcterms:spatial "7005346" appears in our triples. This is http://vocab.getty.edu/tgn/7005346, which has a label of "Belize", which appears on our location browse.

By way of comparison, the dcterms:spatial "1000046" refers to http://vocab.getty.edu/tgn/1000046. This is a reference to Bolivia- this identifier appears on the DMA website, and in FileMaker Pro, and in our triples. It does not appear in our tgn.ttl data file that we use to look up location labels though.

Here is a list of TGN identifiers for which we have no labels:

spatials_without_labels.txt

And here is the TGN data we use to look up those labels:

tgn.ttl.txt

@c-blair, could you check to see if there is a more complete version of the TGN data we could use for these label lookups? I'm guessing that the same thing is happening with our Glottolog language lookups, so if the report I gave for locations is helpful please let me know and I'll repeat it for languages.

c-blair commented 5 months ago

There is no connection between Glottlog and TGN here; totally different beasts.

If I have a list of which TGN codes don't have data, I can provide them. I believe that may have already been supplied and may already be on my to-do list but if so it's fallen off the back of the stove.

On Fri, Jan 19, 2024 at 07:46:33AM -0800, John Jung wrote:

Right now the system is set up to only display place names when we are able to get a label for a TGN identifier. For example, the dcterms:spatial "7005346" appears in our triples. This is http://vocab.getty.edu/tgn/7005346, which has a label of "Belize", which appears on our location browse.

By way of comparison, the dcterms:spatial "1000046" refers to http:// vocab.getty.edu/tgn/1000046. This is a reference to Bolivia- this identifier appears on the DMA website, and in FileMaker Pro, and in our triples. It does not appear in our tgn.ttl data file that we use to look up location labels though.

Here is a list of TGN identifiers for which we have no labels:

spatials_without_labels.txt

And here is the TGN data we use to look up those labels:

tgn.ttl.txt

@c-blair, could you check to see if there is a more complete version of the TGN data we could use for these label lookups? I'm guessing that the same thing is happening with our Glottolog language lookups, so if the report I gave for locations is helpful please let me know and I'll repeat it for languages.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.*Message ID: @.***>

-- Charles Blair | Director, Digital Library Development Center, University of Chicago Library | https://www.lib.uchicago.edu/~chas/

johnjung commented 5 months ago

Thanks very much, Charles. spatials_without_labels.txt should contain the list of TGN codes we don't have data for- if I can get you anything else or run any more reports please let me know.

c-blair commented 5 months ago

Where is that file?

On Fri, Jan 19, 2024 at 08:19:11AM -0800, John Jung wrote:

Thanks very much, Charles. spatials_without_labels.txt should contain the list of TGN codes we don't have data for- if I can get you anything else or run any more reports please let me know.

johnjung commented 5 months ago

There is a link to spatials_without_labels.txt in my first comment above.

c-blair commented 5 months ago

moreTGNdata.txt https://drive.google.com/file/d/1JvWRkU9N4tDcbH70A4-aguhysana_uxh/view?usp=drive_web

johnjung commented 5 months ago

I just requested access to the file above.

johnjung commented 5 months ago

Thanks very much for providing access to that file, Charles. I just rebuilt the database with the new data- you can see the new location browse on ola.lib. There are more locations with labels now, but I see that the following identifiers still appear in our FileMaker Pro data but not in moreTGNdata.txt (and because of that they won't appear in our location browse):

7005490 7005493 70005562 7005494 7005590 1000495 7005580 7007227 7005592 7005592 7005592 7005592 7005592 7005592 7005592 1000765 7005600 1016710 7005599 1016643 7005591 7013596 7005585 7005560 7005598 7005575 1001893 1016717 7005587 7005346 1016701 7005554 1000639 1000785 1000736

(Please note that at least one of the identifiers above seems to be a typo: https://vocab.getty.edu/tgn/70005562 doesn't return any data via a web browser.)

How do you feel about this? Are the location browses we have on ola.lib acceptable now, or should we find a way to keep debugging this? Is even more TGN data available from somewhere, or was that all we can get?

johnjung commented 5 months ago

Thanks very much for the most recent update, Charles. I loaded these onto ola.lib. There are still some differences between our site and dma.uchicago- but, given that we don't know how the list on dma.uchicago was produced, can we call this resolved?

Because @dbietila mentioned languages in his initial report, I'm going to open a separate issue for that so we can track it separately.

johnjung commented 5 months ago

After looking into this further, I think that the best way to proceed with improvements to the language browse is covered in https://github.com/uchicago-library/ucla/issues/95. @dbietila , please let me know if the language aspect of this issue is a blocker that needs to be addressed before launch, or if we can get to this when we can as per https://github.com/uchicago-library/ucla/issues/95.

johnjung commented 5 months ago

At CB's request, to make it easier to evaluate the differences between the location browse on dma.uchicago and ola.lib, here is a log for each one that you can diff in whatever diff tool you like (e.g. vimdiff):

dma_uchicago_locations.log

ola_lib_locations.log

The two lists are very different- but some of the values on the dma.uchicago site lead me to believe that their location browse was either constructed manually or from old data. For example, the first location on their browse is "[Santa Cruz] Barillas". That entire string does not appear anywhere in the FileMaker Pro database. Additionally, neither "Santa Cruz" or "Barillas" appear under "Geographic Coverage" for any item.

Because of this, I feel that it's probably not a good use of time to try to get the browses to "match". Instead my preference is to see if the project stakeholders find the browses on ola.lib useful, and if not, to describe to us why so we can evaluate whether or not we're pulling from the correct place in the FileMaker Pro data.

c-blair commented 5 months ago

These are the diffs (ola dma):

Rajasthan                                                     | Rio Arriba
Rossiya republic                                              | Rossiya
SH03                                                          <
San Casme del Tucson                                          | Saint Petersburg (autonomous city)
San Cristóbal Totonicapán                                     | San Juan Pueblo
San Mateo Ixtatán                                             | San Marcos
San Vito de Normanni                                          | Santa Eulalia
Santa Cruz del Quiché                                         <
Santiago                                                      <
Scotland                                                      | Seattle
Siberia                                                       | Sierra
Sololá                                                        | Sonsonate
South Australia                                               | South America
                                                              > South Asia
South Dakota                                                  <
South San Francisco                                           <
State of Assam                                                | Sudan
State of Kerala                                               <
State of Mahārāshtra                                          <
Stavropol’ Kray                                               <
Suchitepeque                                                  <
Surig                                                         <
Swaziland                                                     <
Tahlequah                                                     | Tamluk
Tamilnadu                                                     | Tantoyuca
                                                              > Tonawanda Indian Reservation
                                                              > Tucson
                                                              > Urban
                                                              > Valladolid
                                                              > Valladolid (province)
                                                              > Venustiano Carranza
Villa de San Francisco de Quito                               <
Xingu, Parque Nacional do                                     | Yucatán
Yucatan                                                       | Zinacantán
Yunnan                                                        | [Santa Cruz] Barillas

I'm a bit perturbed by Yucatan in OLA vs. Yucatán in DMA. The accent mark is omitted if the places are in the U.S.A., but a spot-check shows that they aren't.

johnjung commented 5 months ago

Regarding the accent mark in Yucatán- the database itself stores a TGN identifier, which in this case is http://vocab.getty.edu/tgn/7005600. On the website side we use our TGN data that we've been updating as we work on this issue to look up labels for each location. For that specific TGN identifier, the TGN data has the following assertions:

http://www.w3.org/2000/01/rdf-schema#label "Yucatán"@en "Yucatan"@en

http://www.w3.org/2004/02/skos/core#prefLabel "Yucatan"@en

This seems like a TGN data issue to me- because even if I go to http://vocab.getty.edu/tgn/7005600, the header on the webpage omits the accent. And the data specifically omits the accent in the preferred label (prefLabel.) I don't think there is a good way to solve this problem using the data as-is.

c-blair commented 5 months ago

That's interesting. I made the statement I did by going to the human-readable Getty site and doing the search, which always shows the accent as indicated in my prior statement. This would be a problem to be pursued with Getty.

I think at this point we don't go back to stakeholders. That approach doesn't work at this scale. The stakeholders expect us to launch. If they find problems after launch, then they can get back. What we need to do is make sure that we have done the best we can with what we have, and then launch.

johnjung commented 5 months ago

That's my feeling too. Do the browses look good enough to you to close this issue out, or are there more things I can look into?

c-blair commented 5 months ago

I think that there is at least one glaring QC issue: apparently the Introduction to the Dacca dialect of Bengali is from the 1060s.

johnjung commented 5 months ago

I see that in the FileMaker database, item 8133 has a "Production Dates / Start Year" of 1065- that looks like the cause of the problem. This afternoon I'll open a new issue and assign it to Tom- if you'd prefer to solve this a different way just let me know.

c-blair commented 5 months ago

If Tom can't solve it (not enough data) please get back to me. Thanks.

johnjung commented 5 months ago

No problem. I just submitted https://github.com/uchicago-library/ucla/issues/102 and assigned it to TD.

johnjung commented 5 months ago

https://github.com/uchicago-library/ucla/issues/102 is now closed. Are there any outstanding browse issues, or can we close this ticket too?

johnjung commented 5 months ago

I'm going to close this issue out. Please feel free to re-open if necessary.