Closed xrotwang closed 3 years ago
@kasyrj are you okay with this change? If so I can actually do the conversion.
@lmaurits I'm okay with the change.
The current version of the data contains citation keys that include non-ASCII characters (Ä,ä,ö). Is this a problem for BibTeX conversion (which at least at some point had problems in dealing with non-ASCII characters in its key strings)? Would it actually be best to convert these to ASCII-only keys in the raw data before producing the BibTeX?
@kasyrj I'd recommend sticking to pure-ASCII for BibTeX citation keys.
Citation keys in the raw data are now restricted to ASCII characters. I also spotted some mistakes in the citation data, which I fixed.
Great, thanks! I threw together a hacky script last night to try to do the conversion, taking care of all the BibTeX pain (e.g. wrapping non-leading capitals in {}s). At some point fixing the remaining problems by hand will become quicker than the perfecting the script, and which point I'll change to that. I'll get the first pass at a conversion pushed to the repo this week, then ask Kaj to check over it and make sure I haven't mangled things too badly.
My first conversion effort is now in the repo, but I may still do some minor tweaking either tonight or over the weekend.
@lmaurits what about turning this into a proper BibTeX file? The way it is encoded now would require inserting newlines somehow, to make it possible to feed the strings to a BibTeX parser, right?
Is the proposal to replace the existing .tsv with a .bib, or to maintain both in parallel?
Getting rid of the .tsv seems like a bad idea to me, as the Data.csv file has entries which are basically foreign keys into the first column of Citation_codes.tsv, and getting rid of it would prevent using csvtools or similar to operate on the .tsv files collectively as a database.
Having a separate .bib file duplicating the contents of the .tsv also seems like a bad idea, more specifically a maintenance nightmare. Unless there's a script for converting between the two, they will inevitably fall out of synch, and writing that script doesn't seem like a sensible use of anybody's time.
I admit the approach I've taken here is not perfect - as you note it will need a little bit of processing to actually use with a LaTeX toolchain. But it seemed like the best compromise to me (especially as *TeX usage is nowhere near as widespread in the linguistics world as elsewhere). I am very open to counter-arguments, though.
My proposal would be replacing the .tsv with .bib. The primary keys would still be there, although now as BibTeX citation keys - but that's also how CLDF handles reference, so there's some tool support for this setup.
The biggest advantage of this change, I think, is that the references could be managed with a proper reference management tool like jabref, or other BibTeX-aware tools. And while I agree that *TeX usage in the linguistics community isn't widespread, I don't see any alternative being wider spread :) Presumably, >90% of these reference entries could be replaced with Glottolog ref IDs anyway, because they are already in Glottolog.
I consulted with the UraLex crew regarding this, and currently we're not ready to do away with the Citation_codes.tsv file entirely for the time being. This is partially due to compatibility issues (e.g. how to represent "language expert"-type reference entries in bibtex; how to represent the "type" field in Citation_codes.tsv in bibtex), but also because we're on a schedule when it comes to releasing the data.
I've now tested and revised Luke's bibtex conversion manually. The raw data folder now also contains a Citations.bib file, which contains all the bibtex entries from Citation_codes.tsv (but not the expert entries), as well as a rudimentary script with which the .bib file can be produced from the tsv whenever the raw data gets updated.
Hopefully this is sufficient.
It would be nice to have the references in
Citation_codes.tsv
available as BibTeX.