lexibank / uralex

UraLex basic vocabulary dataset
Creative Commons Attribution 4.0 International
3 stars 5 forks source link

Convert Citation_codes to BibTeX #1

Closed xrotwang closed 3 years ago

xrotwang commented 6 years ago

It would be nice to have the references in Citation_codes.tsv available as BibTeX.

lmaurits commented 6 years ago

@kasyrj are you okay with this change? If so I can actually do the conversion.

kasyrj commented 6 years ago

@lmaurits I'm okay with the change.

kasyrj commented 6 years ago

The current version of the data contains citation keys that include non-ASCII characters (Ä,ä,ö). Is this a problem for BibTeX conversion (which at least at some point had problems in dealing with non-ASCII characters in its key strings)? Would it actually be best to convert these to ASCII-only keys in the raw data before producing the BibTeX?

xrotwang commented 6 years ago

@kasyrj I'd recommend sticking to pure-ASCII for BibTeX citation keys.

kasyrj commented 6 years ago

Citation keys in the raw data are now restricted to ASCII characters. I also spotted some mistakes in the citation data, which I fixed.

lmaurits commented 6 years ago

Great, thanks! I threw together a hacky script last night to try to do the conversion, taking care of all the BibTeX pain (e.g. wrapping non-leading capitals in {}s). At some point fixing the remaining problems by hand will become quicker than the perfecting the script, and which point I'll change to that. I'll get the first pass at a conversion pushed to the repo this week, then ask Kaj to check over it and make sure I haven't mangled things too badly.

lmaurits commented 6 years ago

My first conversion effort is now in the repo, but I may still do some minor tweaking either tonight or over the weekend.

xrotwang commented 6 years ago

@lmaurits what about turning this into a proper BibTeX file? The way it is encoded now would require inserting newlines somehow, to make it possible to feed the strings to a BibTeX parser, right?

lmaurits commented 6 years ago

Is the proposal to replace the existing .tsv with a .bib, or to maintain both in parallel?

Getting rid of the .tsv seems like a bad idea to me, as the Data.csv file has entries which are basically foreign keys into the first column of Citation_codes.tsv, and getting rid of it would prevent using csvtools or similar to operate on the .tsv files collectively as a database.

Having a separate .bib file duplicating the contents of the .tsv also seems like a bad idea, more specifically a maintenance nightmare. Unless there's a script for converting between the two, they will inevitably fall out of synch, and writing that script doesn't seem like a sensible use of anybody's time.

I admit the approach I've taken here is not perfect - as you note it will need a little bit of processing to actually use with a LaTeX toolchain. But it seemed like the best compromise to me (especially as *TeX usage is nowhere near as widespread in the linguistics world as elsewhere). I am very open to counter-arguments, though.

xrotwang commented 6 years ago

My proposal would be replacing the .tsv with .bib. The primary keys would still be there, although now as BibTeX citation keys - but that's also how CLDF handles reference, so there's some tool support for this setup.

The biggest advantage of this change, I think, is that the references could be managed with a proper reference management tool like jabref, or other BibTeX-aware tools. And while I agree that *TeX usage in the linguistics community isn't widespread, I don't see any alternative being wider spread :) Presumably, >90% of these reference entries could be replaced with Glottolog ref IDs anyway, because they are already in Glottolog.

kasyrj commented 6 years ago

I consulted with the UraLex crew regarding this, and currently we're not ready to do away with the Citation_codes.tsv file entirely for the time being. This is partially due to compatibility issues (e.g. how to represent "language expert"-type reference entries in bibtex; how to represent the "type" field in Citation_codes.tsv in bibtex), but also because we're on a schedule when it comes to releasing the data.

I've now tested and revised Luke's bibtex conversion manually. The raw data folder now also contains a Citations.bib file, which contains all the bibtex entries from Citation_codes.tsv (but not the expert entries), as well as a rudimentary script with which the .bib file can be produced from the tsv whenever the raw data gets updated.

Hopefully this is sufficient.