open-editions / corpus-joyce-ulysses-tei

James Joyce's novel Ulysses in TEI XML. Work-in-progress.
20 stars 17 forks source link

List of additional Joycean compounds #51

Open droher opened 5 years ago

droher commented 5 years ago

I wrote a hacky algorithm to find likely Joycean compounds. It excludes any words already tagged as compounds in the XML, as well as any words inside of a foreign language tag. There are plenty of false positives, but it does a pretty good job at sending likely ones to the top of the list: compound_guesses.txt

I'd be happy to put up a PR to add a bunch of these to the XML, but I wanted to check before I did to see if you'd be interested/if that was the best way to go about it. Thanks!

JonathanReeve commented 5 years ago

That's great! This is really cool. How do you imagine encoding it? A quick guess of mine might be:

Something like this:

<distinct type="nonstandard-compound">
buttocksmothered
<choice>
  <reg>buttocks mothered</reg>
  <reg>buttock smothered</reg>
</choice>
</distinct>
droher commented 5 years ago

This is the first TEI doc I've ever used, so I would defer to you on the right encoding. If we go the reg route above, does that mean that the text inside of those tags would be picked up as text properties by XML parsers? Would there be a way to include them as attributes of an element instead?

Also I think the example you chose is one of the very few that would have two different interpretations (and even there, the second one is the clear primary meaning), but it's not any extra work to add multiple sensible choices where they exist.

A few more interpretive questions:

JonathanReeve commented 5 years ago

So far, we've just maintained a list of these tags that shouldn't be rendered as text during a transformation, but we could make that more explicit in the markup by adding a property like rend="none", which I might put in <choice>.

There's a way to indicate which of two choices is the primary one, maybe using certainty, but unless you're feeling extra ambitious I wouldn't worry about this for now.

For lack of a more specific term ("nonstandard adverbial construction"?), type="Joycean" sounds about right to me for cases like incoordinately. Also for musemathics, for different reasons. Wiktionary would work here. @sk3853, do you have any ideas about how to handle these, based on your experience with categorizing distinct words?

Looking forward to seeing the PR.

droher commented 5 years ago

Great, aiming to get the PR up this coming weekend.

The concern I have around both the tag list and the rend=None options is that they're not as obvious to users (like me) who are expecting the text properties of the XML to just contain the text of Ulysses. If there are already examples like this in the XML, then adding one more wouldn't be a problem - maybe the solution is just calling those non-rendering tags out more explicitly in the doc?

sk3853 commented 5 years ago

Hi guys- with regards to Joyceans: I'm curious as to what variables your algorithm included to distinguish Joyean compounds from nonstandard-compounds, which was my biggest difficulty when going through this manually. I think that it would be most efficient and consistent to tag these words as example, if you're confident that he coined the terms. Once I finish up the rest of the project I'm going to run through it again and confirm that my Joycean words aren't just nonstandard compounds and vice-versa. Perhaps it would be worthwhile to consider changing those tags so there isn't as much overlap. I don't think it would be a bad idea to include multiple interpretations of words like "buttsmothered," but there are so many word choices to interpret that I think that kind of project would make more sense further down the road.

droher commented 5 years ago

Hi @sk3853, could you give an example of "Joyean compounds from nonstandard-compounds"? I thought the distinction was between standard compounds (word exists with a hyphen in the OED) and nonstandard.

The algorithm isn't doing any distinguishing like that yet. It first finds the set of words in Ulysses that are not in a list of English words, and within words, finds instances where two substring pairs are in the list. Then I sort the list by the geometric mean of the lengths of the original word and each word of the substring pair.

Before I put the PR up, I'm going to cross-reference the list against Wiktionary to distinguish between standard and non-standard compounds, and also go through each word manually to weed out false positives.

JonathanReeve commented 5 years ago

If there are already examples like this in the XML, then adding one more wouldn't be a problem - maybe the solution is just calling those non-rendering tags out more explicitly in the doc?

Good idea. There are already quite a few of these styles of tags, and they render all kinds of artifacts, like latitudes and longitudes for <place> tags. In some cases, I have XSLT that hides them, but there really should be a list of these somewhere, or some other kind of logic to hide them.

sk3853 commented 5 years ago

Hi- Sorry, when I wrote that I meant that I had difficulties distinguishing Joycean compound words (buttsmothered) (distinct type= “Joycean”) from words that are maybe better categorized as nonstandard compounds (newsboards)(distinct type= “nonstandard-compound”). That wouldn't be a Joycean, since it's a basic combination of two words, but it's also not in the OED as news-boards. The compound category was an easy one to figure out since the OED came up with the hyphenated word as a suggestion whenever I entered one that was nonhyphenated in Joyce. Your additions are welcome! Just know that the 4 distinct types we have going are -compound -nonstandard-compound -Joycean -archaism. Sorry for the slow replies, my schedule has been crazy recently.

El El mar, oct. 1, 2019 a la(s) 7:09 a. m., David Roher < notifications@github.com> escribió:

Hi @sk3853 https://github.com/sk3853, could you give an example of "Joyean compounds from nonstandard-compounds"? I thought the distinction was between standard compounds (word exists with a hyphen in the OED) and nonstandard.

The algorithm isn't doing any distinguishing like that yet. It first finds the set of words in Ulysses that are not in a list of English words, and within words, finds instances where two substring pairs are in the list. Then I sort the list by the geometric mean of the lengths of the original word and each word of the substring pair.

Before I put the PR up, I'm going to cross-reference the list against Wiktionary to distinguish between standard and non-standard compounds, but I'm going to go through each word manually to weed out false positives.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-editions/corpus-joyce-ulysses-tei/issues/51?email_source=notifications&email_token=AFLPKBLMFFMLBDRYX7PVFW3QMMVU5A5CNFSM4I3PWT6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAA4M2Q#issuecomment-536987242, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLPKBNHLGRMOQ7EBJNADXDQMMVU5ANCNFSM4I3PWT6A .

workshub[bot] commented 3 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

@imrobintomar started working on this issue via WorksHub.

JonathanReeve commented 2 years ago

Hi, @imrobintomar! Glad to see that you've started work on this issue. Let me know if you have any questions along the way!

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

JonathanReeve commented 2 years ago

@imrobintomar, could you say what you had in mind for this issue? And do you have any questions?

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

@Avrnikh-iziki started working on this issue via WorksHub.

JonathanReeve commented 2 years ago

@imrobintomar, have you started work on this issue? I don't see anything in your GitHub account about this yet. Please let me know ASAP.

JonathanReeve commented 2 years ago

Hi @Avrnikh-iziki! Could you tell me what you had in mind for this issue?

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

A user started working on this issue via WorksHub.

workshub[bot] commented 2 years ago

@Brucedevnairobi started working on this issue via WorksHub.

workshub[bot] commented 1 year ago

A user started working on this issue via WorksHub.

workshub[bot] commented 1 year ago

@draconid719 started working on this issue via WorksHub.

workshub[bot] commented 1 year ago

@Natalia-Mikhieieva started working on this issue via WorksHub.

JonathanReeve commented 1 year ago

@Natalia-Mikhieieva, what did you have in mind for this issue?