phoible / dev

PHOIBLE data and development.
https://phoible.org/
GNU General Public License v3.0
121 stars 31 forks source link

aggregating script dropping data? #78

Closed bambooforest closed 9 years ago

bambooforest commented 9 years ago

some weird things come out of the aggregation script that i'm discovering via the R code on multivariate variables i emailed you. for example:

filter(multivariate, coronal==FALSE)

returns:

fan ksf mky sgw thm xbr

filter(final.data, LanguageCode=="ksf") --> 7 phonemes filter(final.data, LanguageCode=="fan") --> 1 phoneme filter(final.data, LanguageCode=="mky") --> 1 phoneme filter(final.data, LanguageCode=="sgw") --> 3 phonemes filter(final.data, LanguageCode=="thm") --> 1 phoneme filter(final.data, LanguageCode=="xbr") --> 1 phoneme

table(final.data$LanguageCode)

shows the phoneme counts are off...

drammock commented 9 years ago

PH source data for fan has errors in the "specific dialect" column that are probably at fault here. See lines 1070-1072 of the data file.

GM source data for xbr has an errant entry in the "specific dialect" column on line 465 of the GM_SEA file. It happens to come on the very last phoneme of the inventory, which is why it ylelds only 1 phoneme.

thm occurs twice in the PH data. I can't discern an immediate cause; will look into it.

On Sat, Apr 18, 2015 at 5:38 PM, Steven Moran notifications@github.com wrote:

some weird things come out of the aggregation script that i'm discovering via the R code on multivariate variables i emailed you. for example:

filter(multivariate, coronal==FALSE)

returns:

fan ksf mky sgw thm xbr

filter(final.data, LanguageCode=="ksf") --> 7 phonemes filter(final.data, LanguageCode=="fan") --> 1 phoneme filter(final.data, LanguageCode=="mky") --> 1 phoneme filter(final.data, LanguageCode=="sgw") --> 3 phonemes filter(final.data, LanguageCode=="thm") --> 1 phoneme filter(final.data, LanguageCode=="xbr") --> 1 phoneme

table(final.data$LanguageCode)

shows the phoneme counts are off...

— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78.

drammock commented 9 years ago

Any idea about the provenance of those spurious "dialects" on lines 1070-1072 in the Fan inventory?

harhar, ŋɡb, nym

I looked through the introduction of the source document and there's no obvious mention of a dialect name. OK to just delete those cells?

drammock commented 9 years ago

xbr problem addressed in #70, commit 116b87caaeb494222cbc11dce652780f7ab4346e

bambooforest commented 9 years ago

Yes, OK to delete those cells. Looks like some copy and paste error.

Specific Dialect Phonemes Questions harhar ŋɡ ŋɡb kp nym ŋmɡb updated from ŋɡb w

from when I was correcting Dr. Green's transcriptions.

On Tue, Apr 21, 2015 at 2:05 PM, Daniel McCloy notifications@github.com wrote:

xbr problem addressed in #70 https://github.com/phoible/phoible/pull/70, commit 116b87c https://github.com/phoible/phoible/commit/116b87caaeb494222cbc11dce652780f7ab4346e

— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78#issuecomment-94762599.

bambooforest commented 9 years ago

A quick table(final.data$SpecificDialect) shows some things I could look at in more detail and correct / normalize:

Dialect Harar harhar macrolanguage nj northern Northern nym ŋɡb One possibility Ordinary'

drammock commented 9 years ago

I removed "One possibility". "harhar", "nym", and "ŋɡb" came from fan, which is now addressed in #70. "nj" came from xbr which is also fixed in #70. "macrolanguage" seems like potentially good information to keep. "Ordinary" could conceivably be a dialect name, as could "Northern" or "Harar". We should probably check the primary sources for those three. Thanks for volunteering!

bambooforest commented 9 years ago

Will do!

On Tue, Apr 21, 2015 at 2:28 PM, Daniel McCloy notifications@github.com wrote:

I removed "One possibility". "harhar", "nym", and "ŋɡb" came from fan, which is now addressed in #70 https://github.com/phoible/phoible/pull/70. "nj" came from xbr which is also fixed in #70 https://github.com/phoible/phoible/pull/70. "macrolanguage" seems like potentially good information to keep. "Ordinary" could conceivably be a dialect name, as could "Northern" or "Harar". We should probably check the primary sources for those three. Thanks for volunteering!

— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78#issuecomment-94775906.

bambooforest commented 9 years ago

Ordinary updated (Ordinary Kreol is the dialect) on my local fork on a branch. Harar is OK. Northern is OK. One thing we should consider is pulling in canonical language names from the Glottolog along with the Glottlog codes.

bambooforest commented 9 years ago

after the latest PR #70

thm phonemes (PH) are still being dropped in aggregation

but the other problems seem to be fixed, i.e.:

filter(final.data, LanguageCode=="ksf") filter(final.data, LanguageCode=="fan") filter(final.data, LanguageCode=="mky") filter(final.data, LanguageCode=="sgw") filter(final.data, LanguageCode=="thm") filter(final.data, LanguageCode=="xbr")

drammock commented 9 years ago

I think the dropping of [thm] is because if the multiple dialects of [thm] listed in the PH raw data. I think the problem will go away when we have proper unique doculect identifiers, but I'm working on a temporary fix in the aggregation code in the meantime. On Apr 30, 2015 4:46 PM, "Steven Moran" notifications@github.com wrote:

after the latest PR #70 https://github.com/phoible/phoible/pull/70

thm phonemes (PH) are still being dropped in aggregation

but the other problems seem to be fixed, i.e.:

filter(final.data, LanguageCode=="ksf") filter(final.data, LanguageCode=="fan") filter(final.data, LanguageCode=="mky") filter(final.data, LanguageCode=="sgw") filter(final.data, LanguageCode=="thm") filter(final.data, LanguageCode=="xbr")

— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78#issuecomment-97709939.

bambooforest commented 9 years ago

So multiple inventories with the same language code (and name) but different dialect name? Perhaps all three together are a unique identifier (at least within a given raw data input source?). I can add the InventoryID to Glottolog IDs mapping file to the repo, if that's helpful? At least until we figure out how to pull that stuff directly from the Glottolog repo.

On Thu, Apr 30, 2015 at 2:32 PM, Daniel McCloy notifications@github.com wrote:

I think the dropping of [thm] is because if the multiple dialects of [thm] listed in the PH raw data. I think the problem will go away when we have proper unique doculect identifiers, but I'm working on a temporary fix in the aggregation code in the meantime. On Apr 30, 2015 4:46 PM, "Steven Moran" notifications@github.com wrote:

after the latest PR #70 https://github.com/phoible/phoible/pull/70

thm phonemes (PH) are still being dropped in aggregation

but the other problems seem to be fixed, i.e.:

filter(final.data, LanguageCode=="ksf") filter(final.data, LanguageCode=="fan") filter(final.data, LanguageCode=="mky") filter(final.data, LanguageCode=="sgw") filter(final.data, LanguageCode=="thm") filter(final.data, LanguageCode=="xbr")

— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78#issuecomment-97709939.

— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78#issuecomment-97754478.

bambooforest commented 9 years ago

note to self: ISO still showing up in the aggregated results:

filter(final.data, LanguageCode == "ISO")

drammock commented 9 years ago

I am simply not seeing that. When I run "ISO" %in% final.data$LanguageCode it says FALSE.

drammock commented 9 years ago

when was the last time you ran the full aggregation script to re-generate the .Rdata file? Maybe you're working from an .RData file that was generated with an outdated version of the agg script.

bambooforest commented 9 years ago

My bad. You're right.

On Wed, May 6, 2015 at 2:35 PM, Daniel McCloy notifications@github.com wrote:

when was the last time you ran the full aggregation script to re-generate the .Rdata file? Maybe you're working from an .RData file that was generated with an outdated version of the agg script.

— Reply to this email directly or view it on GitHub https://github.com/phoible/phoible/issues/78#issuecomment-99432449.

drammock commented 9 years ago

fixed in #101.