langcog / childes-db

A SQL interface for the CHILDES child language corpora
12 stars 5 forks source link

for some corpora, individual kids aren't an option #35

Open ebergelson opened 6 years ago

ebergelson commented 6 years ago

hey, just fumbling through a few of the clinical-mor corpora, I noticed that there are several corpora that have substructure (e.g. a special population & controls sub-folder) with individuals within that that the db doesn't make available in the visualizer. This might just be a function of needing to tamp down on nested structure options (which is maybe something childes could reorganize?) but here are the first few i noticed: Ambrose: HL and NH and then kids within that, but on the shiny the only option is 'all' Conti has 4 sub-corpora, at least one of which has named kids within ENNI has SLI & TD and then individuals within each Ellis-Weimer: shiny shows individual kids but not the main controls/late talkers sub-folder structure

smeylan commented 6 years ago

Noted! I don't think we currently have a way of representing directory-substructure-as-metadata, but as you note a fair number of corpora are organized in this way. Some thought below on how to add this to the database.

In the Ellis-Weimer, ENNI, and Ambrose corpora the substructure has some signature in the XML header (the group tag of the target child in all cases; see below), but this field contains a wide variety of shorthand names that I think we'd need to expand wrt the specific study to be useful ("late" -> "late talker", etc.). So we could add a group_raw and group column to the participants table, with the short and long-form values respectively. To get the long-form, human-useful names, we could specify a mapping for short forms to long forms (a JSON in the form of {corpus:{short : long }}).

The Conti dataset has a yucky structure and doesn't have any indication of structure in the XML and would have to be hard coded. I can email Brian and ask if there's a logic to this.

To see if there are other datasets with similar issues: For making sure that we are at least representing all parts of the directory structure, we could take all parts of the file path in the extraction process and check that they are represented in any field of the database. If there's a part of the path that isn't in the database, we should see some sort of warning or error. Once the group field is added to the database, we should be able to see the remaining set of corpora with nonstandard structures.

<CHAT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://www.talkbank.org/ns/talkbank"
      xsi:schemaLocation="http://www.talkbank.org/ns/talkbank https://talkbank.org/software/talkbank.xsd"
      Media="11043" Mediatypes="audio missing"
      PID="11312/c-00006200-1"
      Version="2.6.1"
      Lang="eng"
      Corpus="EllisWeismer42ec"
      Date="1984-01-01">
  <Participants>
    <participant
      id="CHI"
      role="Target_Child"
      language="eng"
      age="P3Y6M"
      sex="female"
      group="LT"

    />
    <participant
      id="INV"
      role="Investigator"
      language="eng"

    />
amsan7 commented 6 years ago

EastAsian works similarly, although currently we don't retain any of its sub-collection information (Indonesian, Korean, Thai, etc.), just the constituent corpora. Unfortunately the programmatic way to only extract corpora for EastAsian did not generalize to Clinical-MOR, which is why they're all merged in there. This can be fixed. But we would still not record sub-collection information, unless we created a column for it in the corpus table.

As for all of the different kinds of sub-directories inside individual corpora, we only make note of them in the filename field in the transcript table. So it is still possible to differentiate these groups using childesr. Adding new columns to handle varying sub-directory length in each corpora is difficult in a MySQL db, so it would take a bit longer to be able to alter the schema and have the shiny apps make use of these fields. i.e., there may have to be more than 1 group field per transcript (e.g. EllisWeismer/30ec/controls/11057.xml)

smeylan commented 5 years ago

2019 checkin: use filename field for now from childesr — we don't have bandwidth to handle the general case

smeylan commented 2 years ago

This is fundamentally the same problem as #61