lexibank / pylexibank

The python curation library for lexibank
Apache License 2.0
18 stars 7 forks source link

adding virtual column programmatically #252

Closed martino-vic closed 2 years ago

martino-vic commented 2 years ago

Hi,

I'm on a Windows 10, using Python 3.10, and I'm trying to add a virtual column through my conversion script, since there is only one language in my data. The faq mentions that virtual columns can be added manually, but I'm wondering if it's possible to do so through the conversion script as well. I tried to change line 52 in different ways, which resulted in following errors:

"args.writer.add_language(Language_ID="blob", virtual=True)" -> TypeError: Language.init() got an unexpected keyword argument 'Language_ID' "args.writer.add_language(ID="blob", virtual=True)" -> TypeError: Language.init() got an unexpected keyword argument 'virtual'

LinguList commented 2 years ago

Language_ID= is definitely wrong, since the table we talk about is the Language table, so the keyword must be ID here.

But the key error shows that virtual cannot be passed as keyword, so there must be another way to manipulate the language table or to add the language.

LinguList commented 2 years ago

If you check this example here:

https://github.com/cldf-datasets/jipa/blob/b620d3a7eb5219a3aa3ca8fc92015c437fa54cb1/cldfbench_jipa.py#L206-L210

You can also define your language object as a Python dictionary (for one language only), and add the "virtual" keyword there (of course, you'd only write "LanguageTable".

LinguList commented 2 years ago

Sorry, this is wrong. You need to change the column.

LinguList commented 2 years ago

Sorry again: this is even not about the language table, but only about the FormTable, where you'd have to supply your Language_ID. So in this case, the form table has all kinds of columns, and the metadata will specify that the language ID is virtual and the same for all of your data.

LinguList commented 2 years ago

But there is a problem with the add_form or add_form_with_segments function, as it checks if a Language_ID has been passed or not, and throws an error if this happens.

Since it is only one column here, we talk about, and it would require substantial workarounds (as far as I can see), I'd suggest to just add the Language_ID and not use the virtual column here for the FormTable.

xrotwang commented 2 years ago

While the "virtual column" mechanism is kind-of cool, it isn't too well supported by the CLDF toolset (e.g. not i pylexibank, hence this issue). In particular, I wouldn't rely on many tools being able to infer essential things like reference values (aka foreign keys) from virtual columns - which always needs an additional "resolve-column-value" step.

So my recommendation would be to actually add a non-virtual language ID column.

martino-vic commented 2 years ago

Okay, I see, thanks for the help! It isn't a big deal to have the column non-virtually after all, I just checked and it's actually only 28KB to spell out the language ID 3500 times in my data frame, so yeah. I just thought that the virtual column was the preferred way to handle these type of data, but if it's not that's no problem in fact. Thanks once more for the help, I appreciate it!

xrotwang commented 2 years ago

I think virtual columns is in the 20% of the CSVW spec that you shouldn't use, because it's not in the 80% of the spec that tools implement and then claim CSVW compliance :)

martino-vic commented 2 years ago

Alright, I wasn't aware of that, it's great to know this actually for future work as well 👍

xrotwang commented 2 years ago

Yeah, I think we should spell this out more explicitly somewhere. The way it is now, it's nothing you could "know" - but rather something I always do when using software/standards: Stick with the 80% functionality implementing the core use cases.