lexibank / pylexibank

The python curation library for lexibank
Apache License 2.0
17 stars 7 forks source link

`init_profile` for multiple languages #261

Closed FredericBlum closed 1 year ago

FredericBlum commented 2 years ago

Hello everybody,

I am wondering right now how I would specify lexibank.init_profile to distinguish between the different languages in my dataset. I have seen two structures so far in Lexibank sets: a) Individual ortho-profiles for each language, or b) a column specifying language handles for which a certain mapping applies. Can I create either of these, but preferably a), using init_profile?

I saw there are some hints in the code, but it is not clear to me what args.context is about and how I would specify the command line command to use this.

Looking forward to hearing your opinions on this.

LinguList commented 2 years ago

Our current workflow is (we used this on the Andean data, if I remember properly):

  1. create a master profile
  2. create language-specific profiles with a custom script from this master profile

We can discuss to make a command in lexibank that leads to the automatization of step 2, I think we may evan have an issue on this.

We could, of course, also discuss initiating individual profiles for individual languages by modifying the init-profile command.

In my experience, however, this two-step workflow is easier, especially when there are more languages in a dataset.

LinguList commented 2 years ago

@Tarotis, once you are ready with some data, please get back to me, and we write this up for inclusion in pylexibank.

LinguList commented 2 years ago

I may use this to teach you how to write lexibank commands.

FredericBlum commented 1 year ago

@LinguList As I have a first set of items ready for all languages and am only filling gaps right now, we can start working on the ortography. Do you have any example cases (+ commands) where you used this workflow that I could adapt to blumpanotacana?

FredericBlum commented 1 year ago

@LinguList Creating language-specific profiles by adding a lexibank-command would be a great next task to get some code review and feedback on package development. Maybe we can set this up for the second week of March?

LinguList commented 1 year ago

Let me check for the code now.

LinguList commented 1 year ago

Let me first tell you the strategy (which is important):

  1. create a single profile for the whole dataset, make sure it works, but ignore borderline cases or the fact that you may have some sounds that are not perfectly rendered in one language
  2. create the CLDF
  3. the cldf contains both the graphemes and the tokens (check forms.csv), both being in 1-1-relation, so you just need to make a lookup-table for each language, where you only extract those parts that occur in that very language
  4. use that to write the profile for each language and write it to file in etc/orthographies/language_id.tsv
LinguList commented 1 year ago

In a package, you would access the cldf code, not any single profile, create the language-specific profiles from that and write them to the files in etc or maybe a user-specified folder.

LinguList commented 1 year ago

I applied this code the first time for the lsi-project. I also shared it with Sandra, who used it in her Mixtecan study (at least at some point, she never gave me feedback on that). I'd start from the code in LSI, but note that some aspects of the code are specific to the dataset, so you don't want to use them.

LinguList commented 1 year ago

If you want as an alternative an init_profile function that does the init-profile and write multi-language profiles, this can of course also be done, but it may result in extra work, I think. Yet, of course, you can now check profiles for individual languages, so it would not be too bad. Let me know what you prefer.

FredericBlum commented 1 year ago

Thanks, the script you provided worked perfectly, with only very minor modifications necessary