Closed arubehn closed 3 months ago
What you do is: all varieties are linked to the same glottocode, but please give us geolocations for all locations.
They sure have them, right?
Yes, geolocations are available, and I have a file from which I can extract them. How should I share them with you? Do you have any preferences about the data format, or the location where the file should be uploaded?
In the raw folder, right?
Maybe rather etc/
. Isn't the "raw" data from someone else?
I derived the "raw" data from data I got from Simonetta, who herself derived this data from the ALT.
So as I see this: Simonetta is the one having direct access to the online version of the ALT. Here, I found the file that currently converts to all the data in the cldf-folder. But no geolocations where there. So this was a dump of the online data, at least it looks like it. If there are geolocations, they will need to be converted to plain CSV that can be included in the cldfbench script, so I think this qualifies as raw
, if we assume a semantics of raw
is what we must standardize and etc
is additional information edited by us.
@arubehn you asked "which format", and I'd explicitly ask for what you got here, not what you can convert it to, since we want to make this conversion step transparent with cldfbench.
But if I did not understand where these files come from, @arubehn, I'd ask that we start afresh from a new dump of the original data provided by Simonetta that are then placed in the raw folder.
Sure, we can do that. Yesterday I realized that there was an error (that might turn out to be quite severe) at some point of transforming the data, since some forms are mapped to the wrong variety - so at this point, there is no way around re-running everything (and making sure to really have clean data).
I will upload the raw data, as Simonetta gave it to me. Phonetic transcriptions are all stored in simple text files, where one file corresponds to one concept. The files are called accordingly (e.g. bischero_IPA.fon
) and just list phonetic transcriptions by site. It is important to note here that we only want to retain the most frequently given form per site - in most cases, there exist slightly different realizations of the same lemma at the same site (depending on the informant).
As for the geolocations, they are given in a .kml
file, together with a bunch of other information.
I have now added geolocations and Glottolog data to the language varieties - for extracting the coordinates, I wrote a simple Python script read_geodata.py
that is currently located in the top level directory. @LinguList and @xrotwang feel free to move or adjust that file in order to conform to your intended workflows (and ideally tell me what the best practice would be, so I can learn it myself)
If I use helper scripts like this one, I put them in RAW. The idea is that they are used once, so it is better to have them, but they don't need to be in the main folder. the same applies also to the shell script, which I left there so that you can find it ;-)
So I just move both helper scripts to /raw
?
Linking this dataset to Glottolog might be a bit pointless, given that all "languages" are closely related dialects; but if we want to do it nonetheless, I think Fiorentino would be the best candidate, since it essentially covers Tuscan varieties. I am aware of some similar Lexibank datasets where many of the covered varieties are too close to each other to have distinct Glottocodes; one that spontaneously comes to mind is leeainu, where every variety is attributed to one of only two Glottocodes (Hokkaido Ainu and Sakhalin Ainu). What would you suggest, should we add Glottolog references (and if so, what would be the "correct" procedure for that)?