lexibank / abvd

CLDF dataset derived from Greenhill et al.'s "Austronesian Basic Vocabulary Database" from 2020.
https://abvd.eva.mpg.de
Creative Commons Attribution 4.0 International
2 stars 2 forks source link

makecldf fails #3

Closed SimonGreenhill closed 4 years ago

SimonGreenhill commented 4 years ago

... on this entry: word 57 = "?" here.

...which means that the following is passed to add_form:

{'Language_ID': '661', 'Parameter_ID': '57', 'Value': '?', 'Source': ['15258'], 'Cognacy': None, 'Comment': "er-jai 'be married (of woman)' p.78", 'Loan': False, 'Local_ID': '173141', 'Form': None}

... and then we fail with

Traceback (most recent call last):
  File "/Users/simon/projects/lexibank2018/env/bin/lexibank", line 11, in <module>
    load_entry_point('pylexibank', 'console_scripts', 'lexibank')()
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/__main__.py", line 139, in main
    sys.exit(parser.main())
  File "/Users/simon/projects/lexibank2018/env/lib/python3.7/site-packages/clldutils-2.8.0-py3.7.egg/clldutils/clilib.py", line 110, in main
    catch_all=catch_all, parsed_args=args)
  File "/Users/simon/projects/lexibank2018/env/lib/python3.7/site-packages/clldutils-2.8.0-py3.7.egg/clldutils/clilib.py", line 82, in main
    self.commands[args.command](args)
  File "/Users/simon/projects/lexibank2018/env/lib/python3.7/site-packages/clldutils-2.8.0-py3.7.egg/clldutils/clilib.py", line 35, in __call__
    return self.func(args)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/commands/misc.py", line 149, in makecldf
    with_dataset(args, Dataset._install)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/commands/util.py", line 28, in with_dataset
    func(get_dataset(args, dataset.id), **vars(args))
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/dataset.py", line 437, in _install
    if self.cmd_install(**kw) == NOOP:
  File "/Users/simon/projects/lexibank2018/abvd/lexibank_abvd.py", line 39, in cmd_install
    source=[b for b in bibs if b.id in refs.get(wl.id, [])]
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/providers/abvd.py", line 231, in to_cldf
    Local_ID=entry.id,
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/cldf.py", line 222, in add_lexemes
    lexemes = self.add_forms_from_value(split_value=split_value, **kw)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/cldf.py", line 208, in add_forms_from_value
    kw_ = self.add_form(with_morphemes=with_morphemes, **kw_)
  File "/Users/simon/projects/lexibank2018/pylexibank/src/pylexibank/cldf.py", line 157, in add_form
    raise ValueError('language, concept, value, and form '
ValueError: language, concept, value, and form must be supplied

What's the best way to fix this? Should add_form catch this? or should this be caught before getting to add_form?

LinguList commented 4 years ago

This is the new behavior!

LinguList commented 4 years ago

It is there to catch explicitly these errors.

LinguList commented 4 years ago

Go to line 222 in pylexibank/providers/abvd.py and change add_lexemes to add_forms_from_value.

First fix, as add_lexemes is deprecated.

LinguList commented 4 years ago

Ah, and if you think that the ? is "normal" behavior, you can fix this also from within there, and add a line that checks if this form actually is a question makr or not. I.e., you check entry.name and see if it is a valid entry.

LinguList commented 4 years ago

Ah, I get it now, sorry for not reading properly before. There is something wrong, I guess, as the entry should not pass the clean_form command, which is called after splitting it, and yields an empty form, which is then NOT passed to add_form. And clean_form reacts per default specifically on ? as a character here.

LinguList commented 4 years ago

So the fix we need is in pylexibank code, dataset.py:

    def split_forms(self, item, value):
        if value in self.lexemes:  # pragma: no cover
            self.log.debug('overriding via lexemes.csv: %r -> %r' % (value, self.lexemes[value]))
        value = self.lexemes.get(value, value)
        return [self.clean_form(item, form)
                for form in split_text_with_context(value, separators='/,;')]

needs to be modified to:

    def split_forms(self, item, value):
        if value in self.lexemes:  # pragma: no cover
            self.log.debug('overriding via lexemes.csv: %r -> %r' % (value, self.lexemes[value]))
        value = self.lexemes.get(value, value)
        forms = [self.clean_form(item, form)
                for form in split_text_with_context(value, separators='/,;')]
        return [f for f in form if f]

Or similar. As this will only return forms that are not None, and this is crashing the code by now.

LinguList commented 4 years ago

I'd say, this is a good example, why it was good to modify the behavior of the add_lexemes to being more transparent.

LinguList commented 4 years ago

Just linked this as a bug in pylexibank.

SimonGreenhill commented 4 years ago

Fixed.