Closed bambooforest closed 9 years ago
PH source data for fan has errors in the "specific dialect" column that are probably at fault here. See lines 1070-1072 of the data file.
GM source data for xbr has an errant entry in the "specific dialect" column on line 465 of the GM_SEA file. It happens to come on the very last phoneme of the inventory, which is why it ylelds only 1 phoneme.
thm occurs twice in the PH data. I can't discern an immediate cause; will look into it.
On Sat, Apr 18, 2015 at 5:38 PM, Steven Moran notifications@github.com wrote:
some weird things come out of the aggregation script that i'm discovering via the R code on multivariate variables i emailed you. for example:
filter(multivariate, coronal==FALSE)
returns:
fan ksf mky sgw thm xbr
filter(final.data, LanguageCode=="ksf") --> 7 phonemes filter(final.data, LanguageCode=="fan") --> 1 phoneme filter(final.data, LanguageCode=="mky") --> 1 phoneme filter(final.data, LanguageCode=="sgw") --> 3 phonemes filter(final.data, LanguageCode=="thm") --> 1 phoneme filter(final.data, LanguageCode=="xbr") --> 1 phoneme
table(final.data$LanguageCode)
shows the phoneme counts are off...
— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78.
Any idea about the provenance of those spurious "dialects" on lines 1070-1072 in the Fan inventory?
harhar
, ŋɡb
, nym
I looked through the introduction of the source document and there's no obvious mention of a dialect name. OK to just delete those cells?
xbr
problem addressed in #70, commit 116b87caaeb494222cbc11dce652780f7ab4346e
Yes, OK to delete those cells. Looks like some copy and paste error.
Specific Dialect Phonemes Questions harhar ŋɡ ŋɡb kp nym ŋmɡb updated from ŋɡb w
from when I was correcting Dr. Green's transcriptions.
On Tue, Apr 21, 2015 at 2:05 PM, Daniel McCloy notifications@github.com wrote:
xbr problem addressed in #70 https://github.com/phoible/phoible/pull/70, commit 116b87c https://github.com/phoible/phoible/commit/116b87caaeb494222cbc11dce652780f7ab4346e
— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78#issuecomment-94762599.
A quick table(final.data$SpecificDialect) shows some things I could look at in more detail and correct / normalize:
Dialect Harar harhar macrolanguage nj northern Northern nym ŋɡb One possibility Ordinary'
I removed "One possibility". "harhar", "nym", and "ŋɡb" came from fan
, which is now addressed in #70. "nj" came from xbr
which is also fixed in #70. "macrolanguage" seems like potentially good information to keep. "Ordinary" could conceivably be a dialect name, as could "Northern" or "Harar". We should probably check the primary sources for those three. Thanks for volunteering!
Will do!
On Tue, Apr 21, 2015 at 2:28 PM, Daniel McCloy notifications@github.com wrote:
I removed "One possibility". "harhar", "nym", and "ŋɡb" came from fan, which is now addressed in #70 https://github.com/phoible/phoible/pull/70. "nj" came from xbr which is also fixed in #70 https://github.com/phoible/phoible/pull/70. "macrolanguage" seems like potentially good information to keep. "Ordinary" could conceivably be a dialect name, as could "Northern" or "Harar". We should probably check the primary sources for those three. Thanks for volunteering!
— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78#issuecomment-94775906.
Ordinary updated (Ordinary Kreol is the dialect) on my local fork on a branch. Harar is OK. Northern is OK. One thing we should consider is pulling in canonical language names from the Glottolog along with the Glottlog codes.
after the latest PR #70
thm phonemes (PH) are still being dropped in aggregation
but the other problems seem to be fixed, i.e.:
filter(final.data, LanguageCode=="ksf") filter(final.data, LanguageCode=="fan") filter(final.data, LanguageCode=="mky") filter(final.data, LanguageCode=="sgw") filter(final.data, LanguageCode=="thm") filter(final.data, LanguageCode=="xbr")
I think the dropping of [thm] is because if the multiple dialects of [thm] listed in the PH raw data. I think the problem will go away when we have proper unique doculect identifiers, but I'm working on a temporary fix in the aggregation code in the meantime. On Apr 30, 2015 4:46 PM, "Steven Moran" notifications@github.com wrote:
after the latest PR #70 https://github.com/phoible/phoible/pull/70
thm phonemes (PH) are still being dropped in aggregation
but the other problems seem to be fixed, i.e.:
filter(final.data, LanguageCode=="ksf") filter(final.data, LanguageCode=="fan") filter(final.data, LanguageCode=="mky") filter(final.data, LanguageCode=="sgw") filter(final.data, LanguageCode=="thm") filter(final.data, LanguageCode=="xbr")
— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78#issuecomment-97709939.
So multiple inventories with the same language code (and name) but different dialect name? Perhaps all three together are a unique identifier (at least within a given raw data input source?). I can add the InventoryID to Glottolog IDs mapping file to the repo, if that's helpful? At least until we figure out how to pull that stuff directly from the Glottolog repo.
On Thu, Apr 30, 2015 at 2:32 PM, Daniel McCloy notifications@github.com wrote:
I think the dropping of [thm] is because if the multiple dialects of [thm] listed in the PH raw data. I think the problem will go away when we have proper unique doculect identifiers, but I'm working on a temporary fix in the aggregation code in the meantime. On Apr 30, 2015 4:46 PM, "Steven Moran" notifications@github.com wrote:
after the latest PR #70 https://github.com/phoible/phoible/pull/70
thm phonemes (PH) are still being dropped in aggregation
but the other problems seem to be fixed, i.e.:
filter(final.data, LanguageCode=="ksf") filter(final.data, LanguageCode=="fan") filter(final.data, LanguageCode=="mky") filter(final.data, LanguageCode=="sgw") filter(final.data, LanguageCode=="thm") filter(final.data, LanguageCode=="xbr")
— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78#issuecomment-97709939.
— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78#issuecomment-97754478.
note to self: ISO still showing up in the aggregated results:
filter(final.data, LanguageCode == "ISO")
I am simply not seeing that. When I run "ISO" %in% final.data$LanguageCode
it says FALSE
.
when was the last time you ran the full aggregation script to re-generate the .Rdata file? Maybe you're working from an .RData file that was generated with an outdated version of the agg script.
My bad. You're right.
On Wed, May 6, 2015 at 2:35 PM, Daniel McCloy notifications@github.com wrote:
when was the last time you ran the full aggregation script to re-generate the .Rdata file? Maybe you're working from an .RData file that was generated with an outdated version of the agg script.
— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78#issuecomment-99432449.
fixed in #101.
some weird things come out of the aggregation script that i'm discovering via the R code on multivariate variables i emailed you. for example:
filter(multivariate, coronal==FALSE)
returns:
fan ksf mky sgw thm xbr
filter(final.data, LanguageCode=="ksf") --> 7 phonemes filter(final.data, LanguageCode=="fan") --> 1 phoneme filter(final.data, LanguageCode=="mky") --> 1 phoneme filter(final.data, LanguageCode=="sgw") --> 3 phonemes filter(final.data, LanguageCode=="thm") --> 1 phoneme filter(final.data, LanguageCode=="xbr") --> 1 phoneme
table(final.data$LanguageCode)
shows the phoneme counts are off...