lingdb / Sound-Comparisons

Exploring phonetic diversity across language families —
http://www.soundcomparisons.com
Other
13 stars 8 forks source link

Change praat script to split up sound files to *not* overwrite transcriptions with blanks. #347

Open PaulHeggarty opened 8 years ago

PaulHeggarty commented 8 years ago

Hans-Jörg, issue created here for the record, one to discuss first before you take any action, but we think this may explain the mysterious losses of transcriptions that we've been noticing every now and then.
Transcription records should only be send out perhaps for Germanic and Mapudungun, because these are the only families for which transcriptions are indeed entered in the TextGrids, rather than in Excel sheets. Currently, if one re-runs the script to repair any missing sound files, for example, it perhaps deletes all transcriptions loaded up from Excel sheets.

Bibiko commented 8 years ago

The logic behind the SQL script generated by the extractWAVFiles Praat script is the following: • it will insert a new record with all column data out of the TextGrid file for a word index if the word index doesn't exist in the given table • it will update an existing row and all its columns based on the unique key computed by the columns StudyIx, FamilyIx, IxElicitation, IxMorphologicalInstance, AlternativePhoneticRealisationIx, AlternativeLexemIx, and LanguageIx with all column data out of the TextGrid file.

In other words if one re-runs the script and runs the generated SQL statements all existing data in the database will be overwritten by the content of the underlying TextGrid file, i.e. e.g. if the TextGrid file has no data for the transcription field an empty string will be stored in the database even if a transcription was saved early on in the database. Thus one has to be careful which TextGrid is used and it is not allowed to change any of these fields StudyIx, FamilyIx, IxElicitation, IxMorphologicalInstance, AlternativePhoneticRealisationIx, AlternativeLexemIx, and LanguageIx- otherwise it could potentially happen that after an update "ghost records" perturb the web script [@runjak ?].

The normal workflow is: 1) the large sound file will be aligned (words) in Praat mostly without any transcriptions 2) one runs the extractWAVFile script and gets the resulting SQL file 3) this SQL file will be uploaded [the database has now dummy records meaning some information are still missing] 4) transcriptions (and maybe other information) will be added to the aligned TextGrid file 5) one re-runs the extractWAVFile script and gets the resulting SQL file (now with transcriptions) 6) this new SQL will be uploaded

Correct? If so, everything should work.

One issue just came into my mind: The database's encoding is set to UTF-8, the generated SQL file must be encoded in UTF-8 as well. Praat's internal output encoding could vary based on the used OS and customized settings. Normally if one tries to run a SQL file with an invalid encoding the database throws an error message but ... ... This leads to another topic, we started SoundComparions by using a MySQL server and changed to MariaDB later on (based on several reasons like Open Source etc.) - the docs are telling us that there're no differences but ... [@runjak ?].

It's a bit cumbersome to make a try at reproducing this behaviour. One has to know what was done by using which data, in what order, etc. - at any rate I'll try to find such a scenario.

runjak commented 8 years ago

Hi,

I agree with the observations by @Bibiko.

PaulHeggarty commented 8 years ago

We cannot have just one workflow, unfortunately. The problem is that in most cases, for most families, transcriptions are only ever entered in Excel, never in the TextGrids. That will be the norm in future.

So in fact the main purposes of the script are just: 1) Create the individual word by word sound files. And re-create them if necessary to fix problems. 2) Create dummy records the first time. This should never be re-done. I.e. we do not often want the script to export transcriptions, because they are never normally entered in praat textgrids now. But we still need to be able to rerun the script to re-export sound files again in certain problematic cases. It will always be necessary to have this ability.

I suggest that if possible (which I presume it is, by changing the sql code), the sql output should be changed to a syntax that does create a new record if no corresponding record exists, but does not overwrite any existing record. That should fix the problem in practice in all current cases.

(This way, in the now very rare cases where transcriptions are entered into the TextGrid, and later corrected there too, then we manually run an SQL command as I do from time to time, to delete all records for that language. The praat script then will re-export the transcriptions from the TextGrid.)

An alternative is to clone the existing script into two versions: 1) As now, segment sound files and export SQL records. 2) Re-export sound files only.

PaulHeggarty commented 8 years ago

Any advance on this topic? It still needs to be fixed, I think, and urgently, to stop creating lots more work recreating good transcriptions that the praat script deletes.

Bibiko commented 8 years ago

One option could be to modify the Praat script in such a way that only information out of a TextGrid cell will be uploaded which is non-empty. The side-effect of doing so would be that one cannot delete a data field (e.g. a wrongly entered lex2), this one has to delete manually. On the other if we're "only" talking about transcriptions then one could modify the script only for that particular field holding the transcription data, i.e. field is empty => no change in the database - field is non-empty => database will be updated.