Open droher opened 5 years ago
That's great! This is really cool. How do you imagine encoding it? A quick guess of mine might be:
<reg>
tag to show the regularized form (i.e. two words reconstructed from their compound<choice>
to show multiple possibilitiesSomething like this:
<distinct type="nonstandard-compound">
buttocksmothered
<choice>
<reg>buttocks mothered</reg>
<reg>buttock smothered</reg>
</choice>
</distinct>
This is the first TEI doc I've ever used, so I would defer to you on the right encoding. If we go the reg
route above, does that mean that the text inside of those tags would be picked up as text
properties by XML parsers? Would there be a way to include them as attributes of an element instead?
Also I think the example you chose is one of the very few that would have two different interpretations (and even there, the second one is the clear primary meaning), but it's not any extra work to add multiple sensible choices where they exist.
A few more interpretive questions:
musemathics
?remarkablest
or incoordinately
? Are these best thought of as compounds?So far, we've just maintained a list of these tags that shouldn't be rendered as text during a transformation, but we could make that more explicit in the markup by adding a property like rend="none"
, which I might put in <choice>
.
There's a way to indicate which of two choices is the primary one, maybe using certainty
, but unless you're feeling extra ambitious I wouldn't worry about this for now.
For lack of a more specific term ("nonstandard adverbial construction"?), type="Joycean"
sounds about right to me for cases like incoordinately
. Also for musemathics
, for different reasons. Wiktionary would work here. @sk3853, do you have any ideas about how to handle these, based on your experience with categorizing distinct
words?
Looking forward to seeing the PR.
Great, aiming to get the PR up this coming weekend.
The concern I have around both the tag list and the rend=None
options is that they're not as obvious to users (like me) who are expecting the text
properties of the XML to just contain the text of Ulysses. If there are already examples like this in the XML, then adding one more wouldn't be a problem - maybe the solution is just calling those non-rendering tags out more explicitly in the doc?
Hi guys- with regards to Joyceans: I'm curious as to what variables your algorithm included to distinguish Joyean compounds from nonstandard-compounds, which was my biggest difficulty when going through this manually.
I think that it would be most efficient and consistent to tag these words as
Hi @sk3853, could you give an example of "Joyean compounds from nonstandard-compounds"? I thought the distinction was between standard compounds (word exists with a hyphen in the OED) and nonstandard.
The algorithm isn't doing any distinguishing like that yet. It first finds the set of words in Ulysses that are not in a list of English words, and within words, finds instances where two substring pairs are in the list. Then I sort the list by the geometric mean of the lengths of the original word and each word of the substring pair.
Before I put the PR up, I'm going to cross-reference the list against Wiktionary to distinguish between standard and non-standard compounds, and also go through each word manually to weed out false positives.
If there are already examples like this in the XML, then adding one more wouldn't be a problem - maybe the solution is just calling those non-rendering tags out more explicitly in the doc?
Good idea. There are already quite a few of these styles of tags, and they render all kinds of artifacts, like latitudes and longitudes for <place>
tags. In some cases, I have XSLT that hides them, but there really should be a list of these somewhere, or some other kind of logic to hide them.
Hi- Sorry, when I wrote that I meant that I had difficulties distinguishing Joycean compound words (buttsmothered) (distinct type= “Joycean”) from words that are maybe better categorized as nonstandard compounds (newsboards)(distinct type= “nonstandard-compound”). That wouldn't be a Joycean, since it's a basic combination of two words, but it's also not in the OED as news-boards. The compound category was an easy one to figure out since the OED came up with the hyphenated word as a suggestion whenever I entered one that was nonhyphenated in Joyce. Your additions are welcome! Just know that the 4 distinct types we have going are -compound -nonstandard-compound -Joycean -archaism. Sorry for the slow replies, my schedule has been crazy recently.
El El mar, oct. 1, 2019 a la(s) 7:09 a. m., David Roher < notifications@github.com> escribió:
Hi @sk3853 https://github.com/sk3853, could you give an example of "Joyean compounds from nonstandard-compounds"? I thought the distinction was between standard compounds (word exists with a hyphen in the OED) and nonstandard.
The algorithm isn't doing any distinguishing like that yet. It first finds the set of words in Ulysses that are not in a list of English words, and within words, finds instances where two substring pairs are in the list. Then I sort the list by the geometric mean of the lengths of the original word and each word of the substring pair.
Before I put the PR up, I'm going to cross-reference the list against Wiktionary to distinguish between standard and non-standard compounds, but I'm going to go through each word manually to weed out false positives.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-editions/corpus-joyce-ulysses-tei/issues/51?email_source=notifications&email_token=AFLPKBLMFFMLBDRYX7PVFW3QMMVU5A5CNFSM4I3PWT6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAA4M2Q#issuecomment-536987242, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLPKBNHLGRMOQ7EBJNADXDQMMVU5ANCNFSM4I3PWT6A .
A user started working on this issue via WorksHub.
@imrobintomar started working on this issue via WorksHub.
Hi, @imrobintomar! Glad to see that you've started work on this issue. Let me know if you have any questions along the way!
A user started working on this issue via WorksHub.
A user started working on this issue via WorksHub.
@imrobintomar, could you say what you had in mind for this issue? And do you have any questions?
A user started working on this issue via WorksHub.
@Avrnikh-iziki started working on this issue via WorksHub.
@imrobintomar, have you started work on this issue? I don't see anything in your GitHub account about this yet. Please let me know ASAP.
Hi @Avrnikh-iziki! Could you tell me what you had in mind for this issue?
A user started working on this issue via WorksHub.
A user started working on this issue via WorksHub.
A user started working on this issue via WorksHub.
A user started working on this issue via WorksHub.
A user started working on this issue via WorksHub.
A user started working on this issue via WorksHub.
A user started working on this issue via WorksHub.
A user started working on this issue via WorksHub.
A user started working on this issue via WorksHub.
A user started working on this issue via WorksHub.
A user started working on this issue via WorksHub.
@Brucedevnairobi started working on this issue via WorksHub.
A user started working on this issue via WorksHub.
@draconid719 started working on this issue via WorksHub.
@Natalia-Mikhieieva started working on this issue via WorksHub.
@Natalia-Mikhieieva, what did you have in mind for this issue?
I wrote a hacky algorithm to find likely Joycean compounds. It excludes any words already tagged as compounds in the XML, as well as any words inside of a foreign language tag. There are plenty of false positives, but it does a pretty good job at sending likely ones to the top of the list: compound_guesses.txt
I'd be happy to put up a PR to add a bunch of these to the XML, but I wanted to check before I did to see if you'd be interested/if that was the best way to go about it. Thanks!