indexphonemica / data

2 stars 3 forks source link

Where do I start? #20

Open sibkhatru72 opened 4 years ago

sibkhatru72 commented 4 years ago

I'd like to contribute to Index Phonemica by filling in phonological inventories (I'm mainly interested in Western North American languages). I more or less understand the data format, but I have some questions. 1) Do I have to include all allophonic rules from the source or only the major ones? E.g., do I include rules like "/k/ is somewhat backed before back vowels (but does not merge with /q/)"? 2) What do I do when the source has, e.g., č without specifying whether it is postalveolar /tʃ/ or alveolo-palatal /tɕ/ or palatal /c/? Looking here (http://www.indexphonemica.net/doculects/xara1244-1) and comparing this to the source I get the impression that a "palatal" ñ is by default interpreted as /ȵ/, and "palatal" c as a postalveolar /tʃ/, when there is no contrast between the three POAs. By the way, does the Sinological tradition for alveolo-palatals have a symbol for an alveolo-palatal lateral fricative?

defseg commented 4 years ago

1‌) The more allophonic rules, the better, but only major ones is fine. If the difference between allophones is subtle enough that you need diacritics to capture it (e.g. k > k̠ / _V[back]), if it's something that's probably too common cross-linguistically to be commonly noted, or if it's otherwise minor, it's OK to leave it off. The amount of allophonic information provided varies highly enough by source that the absence of an allophonic rule in an entry can't be taken to imply much about the language's phonology. (Absence of evidence is not evidence of absence.)

OTOH, in that specific example, the existence of a phonetic contrast between back velars and uvulars is interesting, so I'd include it. One of IPHON's major long-term design goals is to be readily searchable, and "do any languages have phonetic contrasts between back velars and uvulars?" is a search query that one might want to run.

(The current search interface is best seen as a first draft; there are a lot of search queries that should be easily expressible but aren't yet, and the format is designed to work with the ideal search interface and query language, not the one that currently exists. Although the format isn't perfect either - the phonotactics section in particular is really bad, and ideally it'd be possible to specify all the clusters and have it work out all the World Phonotactics Database features from that (possibly with hinting for cases like Hiw, where there's a /w/ but it patterns as a fricative), but I haven't worked out a good way to do that yet, and at any rate it's dependent on having good segment featuralization that can integrate with our toolchain, which we currently don't.)

Rules that are clearly morphophonological can be left out entirely.

2‌) There's a bar operator in cases where phonemes are underspecified, so I'd do tʃ|tɕ|c. (But I'm not entirely clear yet on the proper use on the IPA c series... do any languages have a phonetic contrast between palatoalveolars, true palatals, and palatovelars?)

The Xârâcùù entry isn't really correct - the palatal series should be c|tɕ ɟ|dʑ ç ɲ|ȵ j. Non-Sinological IPA doesn't distinguish between palatoalveolars and palatals, so these could be either (strictly speaking the anterior coronal series should also be t̪|t etc.), but they can't be postalveolars because /ʃ/ is in its own column. I haven't always been as consistent as I should be, and there are probably many errors and oversights in the database - feel free to fix them if you see them.

3‌) Looking over the extant documentation here, I notice that there isn't anywhere near as much on getting set up and contributing as there should be. I'll add better documentation soon (hopefully over the weekend), but let me know if you have any problems or questions. (Even if you think they're dumb questions, or if they're about the technical side of things - what I'd like is to make it easy for people to submit inventories even if they don't have much technological experience.)

sibkhatru72 commented 4 years ago

The most dumb question about the technical side of things - where do I send/upload the .ini files, especially if I'm too lazy to learn the GitHub system of branches and pull requests?

defseg commented 4 years ago

The fact that we use Git isn't ideal from a usability perspective... email to indexphonemica аt gmаil and I'll take care of the Git stuff. There should probably be a web form with automatic validation at some point.

Since I haven't finished the documentation updates yet: the Python validation scripts (checking to make sure that the entries are well-formed) require Python 3, but nothing else. Run commit.py abcd1234-1 to validate doculects/abcd1234-1.ini - this should be done before submitting things, unless you just can't get Python to work right.

There's also a convenience script add.py - I usually use this like add.py abcd1234 -n Test --simple, to add a new entry for a doculect called Test with the glottocode abcd1234. The other main command-line option is -b <bibkey>, which will automatically load source information from a Glottolog bibkey, but using this requires installing and setting up pyglottolog (and changing line 96 of add.py since I haven't made the glottolog directory not be hard-coded yet). (Another thing that should probably happen at some point - some later point than a web form, which now that it's occurred to me I think should take a pretty high priority - is a sqlite frontend to pyglottolog, which will be a lot easier to work with, and a lot faster.)

sibkhatru72 commented 4 years ago

I've tried commit.py on various doculects, it always gives the same result, e.g.: C:\Users\user>py Desktop\data-master\commit.py xara1244-1 Traceback (most recent call last): File "Desktop\data-master\commit.py", line 154, in validate(doculect) # if it's invalid, this will throw an exception File "Desktop\data-master\commit.py", line 84, in validate raise IncorrectSectionsError(INCORRECT_SECTIONS_ERROR_MSG) main.IncorrectSectionsError: Sections are incorrect (should be core, source, notes, (phonotactics,) phonemes, allophonic_rules)

defseg commented 4 years ago

Ah, sorry, that's one of the sharp edges I haven't sanded down yet... commit.py needs to be run from the data directory, C:\Users\user\Desktop\data-master\. When you tell it commit.py xara1244-1, it'll go looking for doculects\xara1244-1.ini relative to the current directory - in this case it's trying to find C:\Users\user\doculects\xara1244-1.ini, which doesn't exist.

I've just pushed a new version that has a better error message for that (and for regular file not found, which also used to throw that error).

sibkhatru72 commented 4 years ago

Thanks, now it works.

sibkhatru72 commented 4 years ago

Should I include URL for papers under paywall?

defseg commented 3 years ago

@sibkhatru72 URL, DOI, or Glottocode - ideally all three, but one should be sufficient. (Many papers don't even have DOIs.)

(I've been pretty busy lately and haven't had the time to work on this or check notifications, but I'll get back to it soon)

sibkhatru72 commented 3 years ago

There is no line for DOI in the template. This probably should be fixed.

defseg commented 3 years ago

@sibkhatru72 Fixed. At some point there should probably be a better format for references, but developing one is extremely out of scope and there are no good standards that I know of. (Official citation formats are too tedious.)

The current official stance is that author/title/year is the absolute minimum and ideally there'd be either a DOI or a Glottocode; in cases where neither exist (which are common), a URL is acceptable.

Probably at some point I should figure out what Glottolog does and try to get upstream of them so that every IPHON source has a Glottocode. But formal release is waiting on better featuralization.