open-editions / corpus-joyce-ulysses-tei

James Joyce's novel Ulysses in TEI XML. Work-in-progress.
20 stars 17 forks source link

Identifying place names in Ulysses #27

Open cderven opened 7 years ago

cderven commented 7 years ago

Tagging the corpus to identify place names might form an interesting parallel to some of the work that you’ve been doing to date. Would others see it as sufficiently interesting to identify toponyms, provide some basic geocoding, and then possibly link the locations to external sources like GeoNames?

I’m thinking of the convention in TEI to use <placename> to identify toponyms and <place> to contain data about locations and then linking the two using ids. So, taking an example from Wandering Rocks:

<listplace>
...
 <place xml:id="wrL19" type="route">
      <placeName>Aldborough House</placeName>
      <location>
        <settlement> Dublin</settlement>
        <country> Ireland</country>
        <geo>53.39979659999999, -6.2435338</geo>
      </location>
  </place>
</listplace>

and

<p> <lb n="100083”/> Near <placeName ref="#wrL19">Aldborough house</placeName> Father Conmee thought of that spendthrift 
<lb n="100084"/>nobleman. And now it was an office or something. </p>

<place> can be used in conjunction with a type, so in this example a route. Is this useful?

Using would allow toponym identification and would contain the data used for geocoding.

Ronan alerted me also to this very interesting project: https://muziejus.github.io/wandering-rocks/.

yellwork commented 7 years ago

Sorry it took me a few days to chime in here, Caleb, but this is a terrific idea. It would be incredible to get some geotagging into the edition. Have you a sense of anyone who might have compiled a dossier of the place names (if not the <placeName>s) that are mentioned in the book? As always, I’m thinking how great it would be if we could automate some of this labour – or take advantage of existing scholarship on precisely this topic. I know the Gifford and Slote annotations, for example, highlight and locate a good number of the place names…

cderven commented 7 years ago

I've taken longer respond to you Ronan! I haven't come across any full gazetteers for Ulysses. There may be a variety of potential options here though.

  1. Existing annotations like Gifford's or Slote's are definitely viable tools.
  2. Different geotagging tools (Named-Entity Recognition software, etc.) certainly allow partial automation of the process. There are issues around accuracy and precision and a wide divergence of opinion around their appropriateness for literary corpora but I think they're a useful first step.
  3. Crowdsourcing, which I see has been mentioned in Issue #9.

From my experience with past projects I think some combination of automated and manual process seems to work. I've encoded place names in Wandering Rocks using the convention above which I would be happy to merge with the file in the repository, if that's not jumping the gun. I think there may be room for a discussion too about how you may want to model geo elements in the corpus?

yellwork commented 7 years ago

These all sound like very promising suggestions, Caleb. (He says four weeks later.) Slote/Gifford certainly catch a lot but the challenge would be wading through the annotations to find the location-specific ones. I’ve no issue with us using named-entity recognition software and geotagging place names. One question I’d have would be how are the mentions of place names distinguished? I doubt that everything is just flattened, right?, whereby the ‘Dublin’ in ‘The Rocky Road to Dublin’ sung in ‘Nestor’ is the same as ‘the Ards of Down’ just recalled by Deasy or the mention of ‘Sandymount Strand’ as the setting of ‘Proteus’. (Or is that the work of separate encoding?)

I’d certainly be keen to see your encoded place names be merged into the WR file in the repository. Fire ahead!

cderven commented 7 years ago

There's certainly a need for a strong typology of place to disambiguate these different types of mentions, Ronan. With the small piece of work that I did, I used the model developed for the Literary Atlas of Europe.

I'll upload the work that I've done (probably next week) which may be a good starting place around a discussion of typology, modelling, etc.

JonathanReeve commented 7 years ago

Hi Caleb,

This idea is great. You might be able to use Moacir's geolocations from his Wandering Rocks project, which look like they're up here. He says that he's noticed a few problems in Gifford that he's corrected in the data there.

Feel free to push to GitHub as you're working--no need to upload it all at once. Then send a pull request (instructions here) whenever you're at a stopping point.

I think this will be a great contribution to this edition.