biblatex support & other remarks

njbart commented 10 years ago

Brilliant, thank you so much. For the most part, AnyStyle already works extremely well compared with any other parser I tried so far.

Just a few observations and remarks:

"45(3):23-45" is not separated, and parsed as "volume={3}" and "issue={23-45}"
Sometimes, "ed." is chopped off of names, e.g. "Alfred" -> "Alfr"
Frequently, surrounding quotes, and final commas are not removed from titles.
Frequently, periods are chopped off of initials, e.g. "author = {Doe, John R}"
Names containing initials without periods are inverted, e.g. "author = {JR, Doe}"
"Transl." is not removed from a translator's name.
"Accessed" is not removed from an "Accessed" field (also, this field should be named "Urldate" instead, at least for biblatex, see below)
The (field) labels being used seem to based, mostly, on CSL variables, resp., Zotero field names. I'd suggest introducing a few more, e.g.,
- "Series Title" = CSL "collection-title",
- "Series Number" = CSL "collection-number",
- "Book Author" = CSL "container-author",
- "No of Volumes" = CSL "number-of-volumes",
- "Report number" = CSL "number", and
- CSL "original-date";
- maybe also the (hopefully) soon-to-be-introduced "Volume Title" = CSL "volume-title",
- maybe an option to label authors as a corporate authors.

I'd also suggest adding biblatex as a separate output format. While some might want to continue using classical bibtex, I see huge advantages in using biblatex and its much more comprehensive data model, and it'd be nice if AnyStyle could output the biblatex format directly. The most important differences between bibtex and biblatex include:

Improved handling of dates:
- "date = {YYYY-MM-DD}" instead of separate year, month, day fields
- date ranges in the format "date = {YYYY-MM-DD/YYYY-MM-DD}"
- "urldate" (instead of "accessed"; also I'm not sure whether there's any bibtex variant that would accept "accessed")
- "origdate"
An "online" entry type
An "institution" field for report and thesis entries, and an "organization" field for manual and online entry types ("authority", again, is not recognized by biblatex, and probably by no other bibtex variant either)
An "incollection" entry type (AnyStyle uses this, though output for classical bibtex should probably return to "inbook" here)
A "maintitle" field, which will have to be used if AnyStyle starts using CSL "volume-title" (mapping is a bit complicated here, but I'd be happy to help here, as with all biblatex questions).

inukshuk commented 10 years ago

Thanks for this!

I opened a few separate issues so that we will not forget about them and use this thread for the biblatex-related points. I would not mind changing the bibtex export to biblatex for good. We could still retain fields like year for backwards compatibility – do you see any problems with this?

A few remarks reagarding the 'authority/institution/organization' field: I would prefer to retain just a single label for these, because otherwise it is very difficult to keep the training data consistent – that's why we only use 'authority' at the moment. Having said that, it should be possible to apply further differentiation after the tagging process. For instance: we could change 'authority' to 'institution' if the type classifier thinks the item is a thesis or report; and change it to 'organization' for manual or online.

I have always been confused about the 'inbook' 'incollection' distinction. If we mainly want to support biblatex is it OK to just stick with 'incollection'?

Regarding the 'online' entry type: any pointers on how we could detect those? The classifier runs at the very end of the parse process and makes use of the labelled and normalized data. That is, we could say that: if there is an 'accessed/urldate' field the type should be 'online'.

njbart commented 10 years ago

I would not mind changing the bibtex export to biblatex for good. We could still retain fields like year for backwards compatibility – do you see any problems with this?

I would not mind either. However, users of classical bibtex might be disappointed since complete backwards compatibility will not be possible:

Classical bibtex uses, e.g., year = {2014}, month=mar#"15". While biblatex will parse year = {2014}, month = mar, it won't parse the day in year = {2014}, month = mar, day={1}. bibtex on the other hand won't understand year = {2014}, month = {03}, day = {24}. So there's nothing that will work for both bibtex and biblatex.

In biblatex, using the date field exclusively for any kind of date, be it year, year/month, year/month/day as well as date ranges is by far the easiest solution, and that's what I'd prefer AnyStyle to be doing, too. bibtex, of course, cannot use this either, not does it understand biblatex’s date ranges.

Also, bibtex will not understand the "online" entry type, and quite a few other details.

Hence I am afraid two different export modules are needed if you want to support biblatex and bibtex equally well.

Still, my vote is yes to using biblatex only. In case there is strong enough demand, I guess you could always add a dedicated export function for classical bibtex.

A few remarks regarding the 'authority/institution/organization' field: I would prefer to retain just a single label for these, because otherwise it is very difficult to keep the training data consistent – that's why we only use 'authority' at the moment. Having said that, it should be possible to apply further differentiation after the tagging process. For instance: we could change 'authority' to 'institution' if the type classifier thinks the item is a thesis or report; and change it to 'organization' for manual or online.

Fine. My point was only that "authority" is not a biblatex field, and not a field in any bibtex variant I am aware of either, so it should not be exported under that field name.

I have always been confused about the 'inbook' 'incollection' distinction. If we mainly want to support biblatex is it OK to just stick with 'incollection'?

Sorry, got confused, too. bibtex does have both "inbook" and "incollection", too, it only uses them somewhat differently compared with biblatex. "inbook" in fact is the one that does not have a "booktitle" in bibtex and thus is rarely, if ever, used. Which means that "incollection" is usually fine as far as bibtex is concerned.

In biblatex, "incollection" is for a chapter in an edited volume, while "inbook" is for a chapter in an authored volume (think "Hamlet" in Shakespeare, Collected Works), though chapter author and book author do not have to be the same (think Steve Scholar, "Introduction", in Shakespeare, Collected Works). The presence of both author and bookauthor, with or without editor; or title and booktitle without editor would indicate an "inbook" type (as would the combo bookauthor/editor/title/booktitle; probably the only way to assign AnyStyle labels unambiguously to an inbook where author=bookauthor appears only once, but there's also an editor.)

Regarding the 'online' entry type: any pointers on how we could detect those? The classifier runs at the very end of the parse process and makes use of the labelled and normalized data. That is, we could say that: if there is an 'accessed/urldate' field the type should be 'online'.

All other entry types might contain "url" and "accessed/urldate", too. But I guess anything that is currently identified as "misc" while also containing a "url" field could rather safely be assumed to be an "online" entry.

njbart commented 10 years ago

Update on book/collection: biblatex has both, bibtex has book only. Hence, anything that looks like a book but has an editor and no author should be exported as a biblatex "collection" entry. (CSL, like bibtex, does not make that distinction.)

inukshuk commented 10 years ago

Just a quick update on this: I've addressed many of the points now and pushed those changes to anystyle.io as well. We have now collected a lot of data from users which will have to be reviewed – when we do that, we can add new labels as well (like original date).

inukshuk / anystyle

biblatex support & other remarks #8