kaleissin / CALS

Conlang Atlas/Archive of Language Structures
11 stars 0 forks source link

Refine the Sources of the Wordlists and think of linking to the Concepticon #39

Closed LinguList closed 8 years ago

LinguList commented 8 years ago

The current form used for the concept lists is not correctly reflecting how they were created in the history of linguistics. We're currently collecting a resource in which different concept links are linked to a meta-list of concepts, with the application at http://concepticon.clld.org and the data (exceeding the current application by large) at https://github.com/clld/concepticon-data. We still haven't managed to check the whole list of Buck 1949, but an OCRed version with all concepts has been prepared from the book and will be added once we found time to check it. It's not online yet, but I'll gladly push it already before I managed to propoerly link it, if needed.

Anyway: you may find the Concepticon-resource useful for the wordlist managment.

kaleissin commented 8 years ago

The history as-is is as it was represented on Wikipedia ages ago, complete with now deleted references... I don't have the necessary accesses to get a hold of locked-in papers anymore.

I have checked IDS/WOLD with Buck, I own a hardcopy of the book. Many eyes makes all bugs shallow though, as I see I must merge b4.48 and i5.57, "egg". For just the concepts, you could start off with downloading the full csv of "maximum buck" from CALS and check that, much less pain than the OCRed version I would think. Especially as I already have added links to the concept sets in the dev-version. D'you have a user on CALS, for the badge? =)

Would you happen to know:

1) Are the existing concept set ids stable? I can't add them if they are subject to change. 2) Why is "tree" twice in IDS? 1.42 == 8.60. It moved many other things from Buck so why not that? WOLD only has 8.60. 3) What happened to the original source of WOLD? The copy I have has lots of entries numbered (x)x.999x(x), containing among other things "capybara". They are gone from current WOLD overview, but you can still link to them directly: http://wold.clld.org/meaning/3-9991 4) Where does the 207-version Swadesh-list on wikipedia and wiktionary come from?

LinguList commented 8 years ago

But is your buck version literally? I mean, you write:

b9.98 try,test

but Buck 1949 page 652 says:

try (= Make Trial of, Test)

It was for this reason that I started OCRing, and I actually never OCRed the main part, but just the index, which, as I saw now, also turns out to differ from the main part (in the index it says: able, be, but the number 9.95 refers to "can, may").

So the whole point of the Concepticon is to have the sources we link in an original form, meaning, that, if we say, we link to Buck 1949, we have a literal gloss as it appears in the opus.

I was considering working on a link to CALS (we currently try to link all lists we can get), especially because of your resource on Buck (1949), which is interesting for us, since we only have IDS and WOLD there. But when I saw that I couldn't tell which part is actually literal, I put this ad acta and followed up the work on the OCR of the register.

We discussed CALS here.

Regarding your questions:

  1. yes, they are stable, this is the whole idea behind it.
  2. no idea, you might want to ask the people who are currently working on an update which should appear any time now
  3. no idea, sorry
  4. this is interesting: if you follow up our sources in the Concepticon, you'll see that this list was first proposed by Bernard Comrie, one of the founders also of IDS, he just merged Swadesh 100 and Swadesh 200 for the 207 concept lists and it is posted online as a field work guide (for those who want to start basic work on their lexicon, posted as part of a questionnaire: https://www.eva.mpg.de/lingua/tools-at-lingboard/questionnaire/linguaQ.php). The wiktionary list seems to have it's earliest version dating back to 2003, but it's not clear whether the people were inspired by Comrie or not, since they changed the concept labels. Yet they commit the same error in mixing the concepts, since Swadesh had two different concepts for "child" in his 1952 and his 1955 list, one time meaning "child, descendant" and one time meaning "child, young human", similarly with "burn" which is one time transitive, one time intransitive.

Maybe it's the best to establish a CALS concept list independent of the predecessors. Once you link this list to the concepticon, you will have automatic access to all resources, like Swadesh 1952, Swadesh 1955, Wiktionary, IDS, WOLD, and many more interesting concept lists for specific language families. We would gladly put it into our collection and link from the Concepticon then back to CALS.

LinguList commented 8 years ago

Forgot to add this:

Here's what we note in the concepticon resource regarding the wiktionary list, but it was posted online before I detected the Comrie list, and the note on the list on the right of the page may be refined in the future.

kaleissin commented 8 years ago

IIRC, what I did was merge WOLD and IDS first, then hand-check with Buck.

kaleissin commented 8 years ago

Are y'all aware of the ULD2? http://www.uld3.org/uld2/uld2.html I haven't removed duplicates/merged it in and AFAIK its purpose is to have a useful set of words for conversation in the world as it exists today. Another list from the conlanging word is dublex, https://web.archive.org/web/20051122060219/http://www.langmaker.com/db/rsc_dublexcompounds.htm, which aims for a balance of maximum compositionality and minimum length of the resulting compounds.

LinguList commented 8 years ago

Thanks for those links, didn't know of them before!

kaleissin commented 8 years ago

What I'm interested in for CALS is more the concepts themselves, and which lists have which concepts, than the literal representation of the concepts in the lists themselves. Time to refactor, I guess.

LinguList commented 8 years ago

Well, I understand your point, but this interest in the concepts themselves was the reason why we now have all the mess: people quote they USE a certain concept list, but in fact, due to misspellings, or concept labels not truthfully noted, they confuse what is actually meant. There are dozens if not hundreds of examples for this mess, including wrong translations across languages, sloppiness, misunderstandings, etc. This is the reason why we actually launched the concepticon project: to make a first attempt to clean this mess. And we came to the conclusion that reflecting the sources as accurately as possible is the only way to acquire a solid basis. You have my contacts now, so whenever you plan on mapping things or want to use our data and encounter problems in finding the right information, don't hesitate to contact us.