nevenjovanovic / croala-pelagios

CITE semantic annotations for place references in Croatian Latin texts
Creative Commons Attribution 4.0 International
1 stars 0 forks source link

Clarification #19

Closed arsimr16 closed 7 years ago

arsimr16 commented 7 years ago

I know I need to go back and fix a lot of problematic lemmata, but before I do that, I want to make sure I understand our system.

I came across many examples of the same lemma having two (or more) urns with different (or missing) definitions. The most common case was that one urn had the first definition listed in the Lewis and Short entry, and the other was missing a definition. I also had many examples of urns with the first definition from Lewis and Short, but the usage in context did not match this first short definition.

I'm questioning our current solution (creating new urns or adding to existing urns). I don't think we should have multiple urns for each definition of the same lemma. I thought that these existing definitions (and the ones we are adding) were supposed to be short definitions. It was my understanding that the urn was supposed to point to a lemma and not a specific entry contained under that lemma. Even when a word can have different meanings in different contexts, I would consider that word to be the same word in each context, not different words. Therefore, I think that even if the short definition of a urn doesn't match the usage of a word in a specific context, it still points to that same word which should be cited with one urn.

I don't know exactly what we will be doing in the next phases of this project, but at this point we can at least say that all of the words we are looking at are places (of some sort). Do we really need the short definition to to tell us that these are places (or to which category of places the words belong)? Won't we be recording this information in a different phase?

Am I making any sense at all? Maybe I'm just confused. I hope we can talk about this during our meeting tomorrow.

nevenjovanovic commented 7 years ago

@arsimr16 -- Today we have discussed a solution to use a single URN for a single lemma, and to disregard homographs, differences in quantity (malus = mālus), and the like. It turns out that the Perseus' lexical inventory file (inventory_import.csv) has 54,413 rows total, and 51,891 distinct lemmata; ergo, there are 2,522 "duplicate" lemmata. It seems to me that it would be best if we "mint" our own complete set of unique lemmas with unique urn:cite:croala:latlexent prefixes, and keep a concordance table on which croala latlexent lemma aligns with which Perseus latlexent. But what would you say about the uppercase / lowercase difference in our lemma list? Do we distinguish between Alba and alba, longus and Longus, or not? (This is connected with #6, by the way.)

nevenjovanovic commented 7 years ago

The set of reformatted, re-minted lemmata is available as an XML file in the csv directory of our repository: csv/croala-latlexents-2.xml. @arsimr16 , please check it out and say what you think. According to our discussions, the lemmata are in UPPERCASE (for example NILUM), and they are unique -- each sequence of letters occurs only once. There are 51,339 records. The file can be validated (e. g. in oXygen) with the schemas/cplatlexents.rng schema file. Alex, let's do some exercises with XML validation next week! All CITE URNs for lemmata are new, and over the weekend I will make sure that the other parts of our system play nice with them.

arsimr16 commented 7 years ago

Everything looks good to me. I'd love to do some exercises with XML validation.

~Alex

On Sat, Nov 5, 2016 at 12:26 AM, Neven Jovanović notifications@github.com wrote:

The set of reformatted, re-minted lemmata is available as an XML file in the csv directory of our repository: csv/croala-latlexents-2.xml https://github.com/nevenjovanovic/croala-pelagios/raw/master/csv/croala-latlexents-2.xml. @arsimr16 https://github.com/arsimr16 , please check it out and say what you think. According to our discussions, the lemmata are in UPPERCASE (for example NILUM), and they are unique -- each sequence of letters occurs only once. There are 51,339 records. The file can be validated (e. g. in oXygen) with the schemas/cplatlexents.rng https://github.com/nevenjovanovic/croala-pelagios/raw/master/schemas/cplatlexents.rng schema file. Alex, let's do some exercises with XML validation next week! All CITE URNs for lemmata are new, and over the weekend I will make sure that the other parts of our system play nice with them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nevenjovanovic/croala-pelagios/issues/19#issuecomment-258571746, or mute the thread https://github.com/notifications/unsubscribe-auth/AFTY4Ugc_tNO2gbXtJOYBaVmlNVf7Xxkks5q678vgaJpZM4KnMq5 .