lexibank / pylexibank

The python curation library for lexibank
Apache License 2.0
18 stars 7 forks source link

lexibank.load fails for datasets with multiple SourceTables #224

Closed chrzyki closed 4 years ago

chrzyki commented 4 years ago

E.g. https://github.com/lexibank/birchallchapacuran defines sources for languages and forms. Calling cldfbench lexibank.load for a data set like this results in

  File "/venv/lib/python3.6/site-packages/pylexibank/db.py", line 369, in load
    t.name, col))
sqlite3.OperationalError: duplicate column name: Source

because sources are supposed to be written only 'once' per data set for the Lexibank sqlite.db?

LinguList commented 4 years ago

This should be captured on the level of the lexibank.makecldf already, as there are quite a few datasets where we have Source for a language and then just use it to make the source for the form. In fact, I reckon these are 50% of all datasets with sources from multiple references.

xrotwang commented 4 years ago

I think this requires a bit more debugging. Which table is causing the problems? And which column exactly is about to be added? And does the db already hold other datasets? Supposedly, this line https://github.com/lexibank/pylexibank/blob/cd08b999271a58abc89d88e091b320418dffc9f4/src/pylexibank/db.py#L365 should guard against the issue. Why doesn't it?

chrzyki commented 4 years ago

Can be reproduced with:

Then:

(lexibanksqlite) ~$ cldfbench lexibank.load _ --glottolog ~/Repositories/glottolog/glottolog --concepticon ~/Repositories/concepticon/concepticon-data/

abvd loads successfully, then:

Dataset "birchallchapacuran" at .virtualenvs/lexibanksqlite/src/lexibank-birchallchapacuran
Traceback (most recent call last):
  File ".virtualenvs/lexibanksqlite/bin/cldfbench", line 8, in <module>
    sys.exit(main())
  File ".virtualenvs/lexibanksqlite/lib/python3.8/site-packages/cldfbench/__main__.py", line 78, in main
    return args.main(args) or 0
  File ".virtualenvs/lexibanksqlite/lib/python3.8/site-packages/pylexibank/commands/load.py", line 17, in run
    with_datasets(args, db.load)
  File ".virtualenvs/lexibanksqlite/lib/python3.8/site-packages/cldfbench/cli_util.py", line 90, in with_datasets
    res.append(with_dataset(args, func, dataset=ds))
  File ".virtualenvs/lexibanksqlite/lib/python3.8/site-packages/cldfbench/cli_util.py", line 82, in with_dataset
    res = func(*arg, args)
  File ".virtualenvs/lexibanksqlite/lib/python3.8/site-packages/pylexibank/db.py", line 367, in load
    conn.execute(
sqlite3.OperationalError: duplicate column name: Source

Could the not-capitalized source in abvd's LanguageTable be problematic?

chrzyki commented 4 years ago

Changing from source to Source in abvd's LanguageTable fixes this for me.

SimonGreenhill commented 4 years ago

Hmm, the abvd provider has lots of lowercase fields so I suspect there might be more clashes..

xrotwang commented 4 years ago

So - considering that SQL is case insensitive - the check in https://github.com/lexibank/pylexibank/blob/cd08b999271a58abc89d88e091b320418dffc9f4/src/pylexibank/db.py#L365 should be case insensitive, too.