Data Input/Storage Recommendation

sillsdev / cog

Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques.

http://sillsdev.github.io/cog/

MIT License

23 stars 10 forks source link

Data Input/Storage Recommendation #47

Open Steve-Miller opened 8 years ago

Steve-Miller commented 8 years ago

Reading through some of the Cog documentation, Cog was apparently meant to interface with WordSurv. WordSurv to Cog is certainly one obvious data path. If I'm reading the documentation correctly, there is also speculation that Cog data might be stored in WordSurv at some future date.

There are two or three issues with this: 1) I am a linguist working with a recording that is a dozen years old. I didn't start with WordSurv. I started with Cog. I'm not sure, but I get the feeling others might take the same route. 2) The phonetic transcription UX in both Cog and WordSurv is less than optimal, particularly if you compare them with software such as ELAN. 3) If I remember correctly, WordSurv stores data in MS Access, a proprietary data format.

I recommend taking the approach that the SayMore team took to address these issues. SayMore stores data in ELAN's .eaf format. This allows users to transcribe data using SayMore or ELAN. That is, the same data can be edited using ELAN or SayMore as a front end.

I can even right-click on the data file in SayMore, and this will pop up a menu offering me to "Open in Program Associated with this File..." This defaults to ELAN, but I get the impression that I could change it to something else if I preferred. So, for example, if a user prefers Praat for some reason, he could use that instead of ELAN.

If Cog used this approach, it would: 1) Give users a refined, open source input method for audio or video phonetic transcription, if desired; 2) Store Cog data in an established, open format that is already used in software data exchange. This should be better than storing data in a unique format that no one else yet knows about.

I recognize that this would require some reworking of Cog's word input UX and data storage mechanisms. I don't have the code, and so I don't know how extensive such a rework would take (even if I have suspicions). Even so, I expect the payoff would be significant enough to justify the expense.

ddaspit commented 8 years ago

As you have found, Cog was never intended to be used for phonetic transcription. Cog was designed to specifically focus on comparison and analysis of word lists. The assumption is that the word lists have been captured and transcribed in another application and then imported into Cog. Cog was not intended to replace or duplicate the functionality of tools like WordSurv, Excel, or ELAN.

As you mentioned, one possible future direction for Cog is to access WordSurv data directly. WordSurv is specifically designed for capturing word lists, so it seems like a complementary fit for Cog and WordSurv to interact in this way. ELAN is designed for a different purpose, which is annotating audio and video data. Although it can be used to transcribe word lists, it is not specifically designed to do so. It is more of a general-purpose tool. The same is true of ELAN's XML format. It is a format for capturing annotations of video and audio data and not word lists. In this regard, it would probably be an inappropriate format for Cog to use natively to store word lists. Having said that, I do think it makes sense to add EAF import to Cog. No one has requested this feature before, so I'm not sure how much ELAN is used by surveyors, but I could certainly see a surveyor using ELAN for word list transcription. Are you using ELAN to do phonetic transcription of word lists? The tricky thing with implementing EAF import is the generic nature of ELAN. There is probably no standard way of formatting the ELAN annotations when capturing word list transcriptions. Each user could do it differently. The EAF import would need to be flexible enough to be able to deal with the various ways that a user could have setup their annotations.

Steve-Miller commented 8 years ago

I have indeed used ELAN together with SayMore for phonetic transcription of a word list. As a user, I didn't have to worry about the proper .eaf format. SayMore opened up the file for me with the proper format in ELAN with a click or two. This eventually became the beginning of my lexicon in FLEx. I'm envisioning the same workflow for Cog.

I doubt I would have thought of using ELAN's .eaf format as a data store on my own. I'm just following behind JohnH, who used it in SayMore, and used it quite well. The philosophy is: "SayMore borrows ELAN’s file format, so that you can do the basics in SayMore (transcription and translation) and then just double-click on the file to do further work in ELAN" if you wish. (http://saymore.palaso.org/news-about-saymore/) Since I've effectively used SayMore + ELAN to transcribe a word list, and since Cog, like Saymore, does not intend to have its own data store, I thought ELAN would be a natural fit for Cog.

In my mind, this is not just fitting WordSurv together with Cog. This is looking at the larger "ecosystem" (to use a buzzword) already established between SayMore, ELAN, and FLEx. I would think having a workflow from WordSurv/Cog all the way into FLEx would be of interest to SIL. I know it's of interest to me. I once wrote about this once to Beth, but I found today that Ryan Pennington already wrote up and published a paper on it: https://www.academia.edu/6474779/Producing_time-aligned_interlinear_texts_Towards_a_SayMore_FLEx_ELAN_workflow. (This was tough for me to get to, even with an Academia account, so you might have to work at it.)

While storing word list transcriptions natively in a specific .eaf format seems like the most elegant solution to me, and entirely appropriate given everything SayMore has done, an .eaf import is my second choice.

Steve-Miller commented 8 years ago

FWIW, this is what SayMore says about the .eaf structure it expects, copied from the help file:

ELAN allows a richly nested and flexible set of tiers, which may be different for each media file. When SayMore uses an ELAN file as the basis for a creating media file's annotation file, it expects certain tiers to exist in that ELAN file. Others may be present, but they will be ignored by SayMore.

If you have an existing ELAN file and would like to associate it with a media file in SayMore, the following must be true:

--There is a Transcription tier which has a type for which Time-alignable is selected ().

--If you already have a tier for translation of those phrases, it must be:

----a child of Transcription

----named Phrase Free Translation

----have a type for which the stereotype is Symbolic Association.

To use an existing ELAN file as the basis for a SayMore annotation file, select Copy an existing ELAN file on the Start Annotating tab. Then you can work with oral translation annotations and careful speech annotations, and transcription and free translation annotations.

If necessary, open the file in ELAN and work with transcription and free translations there. Be careful not to remove or rename the 'Transcription' and 'Free Translation' tiers, or add any additional tiers.

ddaspit commented 8 years ago

This information is definitely helpful. Cog could follow the same tier format as SayMore. The "Transcription" tier would be used for the IPA transcription and the "Phrase Free Translation" tier would be used for the meaning. Does that make sense? Obviously, SayMore is used for any kind of recorded sessions, not just word lists, so Cog would have to have the additional requirements that each word is in a separate annotation and that each meaning in the "Phrase Free Translation" tier is unique. Importing this format wouldn't be hard. I would have to think a lot more about how to use the EAF file directly instead of importing. Cog would still need its own project file, since there is lots of other information in it other than the word lists.

Steve-Miller commented 8 years ago

Yes, when I used SayMore/ELAN to transcribe the word list, the phonetic transcription went into the Transcription tier of ELAN, and the meaning/gloss went into the Phrase Free Translation tier.

If I know JohnH, he separated the data tier from the UX tier. SayMore is a Palasao project, isn't it? One suggestion is to use as much of that code as you can.

I nearly emailed you the word list audio recording I annotated and the SayMore project file yesterday. The mb size stopped me. (I pay by the mb here, plus I'm not sure if there's a size restriction.) I think what I will do instead is transcribe a one-word "word list" in SayMore/ELAN. If I can do it quickly, I'll either attach it in another comment here or email it to you. Then you can see it for yourself, without trying to figure out how to do it.

I emailed you a message a few minutes ago about a surveyor's perspective of the interaction between WordSurv and Cog. In short, the idea never crossed their minds until I asked them about it. It was a surprise to me, too, when I found it in Cog's help file. As far as I can see now, I don't think anyone will use WordSurv for a data store, nor do I expect people to use Cog and WordSurv together.

Steve-Miller commented 8 years ago

Incidentally, SayMore can read an Audacity file. I started there in Audacity, chopping up the word list into individual sound recordings. Really useful. Something more to think about.

Steve-Miller commented 8 years ago

Okay, so we're in luck. I found a recording that has three words in it. It's not a true word list, but it gives you the idea. I annotated this really quickly and gave it some glosses/meanings.

I zipped everything up into a .zip file. It has the .wav file, the SayMore project file, the ELAN .eaf file, and a couple of other files of note. If you install SayMore and ELAN, you should be able to unzip this under the SayMore directory. Unfortunately, Git choked on it, so I'll email it to you.