Closed gmantele closed 1 day ago
@gmantele, we would love to do this! We need to have the author list broken apart and the author names broken into family name and given name in order to mint DOIs (see #299).
We thought about parsing the author lists that people traditionally enter into the Circular bodies, but we have hesitated to do that due to the subtleties of names across cultures (i.e., multiple last names, last names with prepositions, etc.).
The alternative is to allow users to enter that data themselves when they are composing their Circulars, perhaps by allowing them to select other GCN users as co-authors, and perhaps with some auto-completion.
So... this seems complicated. Do you have any ideas?
Thanks for your quick answer.
We are doing this automatic parsing of the author lists, and I can confirm that it is not trivial. We generally do some parsing and then names are always checked (and eventually fixed) by humans. This is clearly not reliable. Hence this GitHub Issue.
I have never submitted any circular, so, I kind of hoped that you got the author names in some kind of HTML form, and then you decided to format the circular body as it is now. But apparently, you do not ; I assume the circular submission form is a free text area in which authors decide to write authors and actual body the way it is now.
My idea of improvement would then be to update the submission form so that authors can be individually input (with or without a separate field for the firstname and lastname). But I also assume the HTML form has been made on purpose like the way it is currently, isn't it? If yes, this idea would probably not be a valid option for you.
Currently, the form has a subject field and a body field, and that's it. By convention, the author list is the first paragraph in the body. That's because originally GCN Circulars were just emails. We don't have to be limited by that now, but as you can imagine we want to keep the data entry burden as low as possible for users given that GCN Circulars are supposed to be rapid communications! So we don't want to design a user interface that requires a user to manually enter the first name, last name, and affiliation of every single author into a table. And we also want to integrate with ORCID. This is mostly a UX design problem.
That's indeed what I assumed and it makes sense.
Then, unfortunately I don't have a good solution for the moment :(
How difficult would it be to change only the request that the authors be written like so:
first_name: John last_name: Doe affiliation: ABCD, first_name: Jane last_name: Doe affiliation: EFG
that would at least give you something to dependably pattern match. It IS a request on the user to type additional words, but in the same way they do currently in the first paragraph. Then we could parse it on the back end as we create it and add an array of formatted authors to the circular if the pattern matcher gets any matches. Basically your circular would only be cite-able if you conform to the standard properly.
I do not know if the draw of being able to be cited would encourage people to use the proper pattern, but it might be worth the tradeoff. And it's still pattern matching and parsing, which isn't ideal since people would have to be precise and we are relying on humans to do so. Could the value of citation and ease of processing be enough to encourage conforming to a standard if widely publicized enough?
It looks like a good idea to me. And it can even be a simpler syntax (a comma as first-last name separator, affiliations between parenthesis, separated with a semi-colon or comma if multiple ; and finally a semi-colon to separate authors):
John, Doe (ABCD) ; Jane, Doe (EFG)
It is still formatted while being simple to write as well as humanly readable. However, one has probably to think about some corner cases:
Duplicate of #893.
I have met @Courey during ADASS XXXIII (Tucson, Arizona) who suggested me to submit a GitHub issue with the following idea of amelioration regarding the JSON format.
The current JSON serialization of GCN circulars is pretty basic and is really close to the email/text serialization. It makes it quite hard to automatically parse, especially for the authors list. All the authors are given in the first paragraph(s), with two different syntaxes:
Let's take the example of the circular 35197:
It would be really great if authors could be split into individual items and put in a separate field of type array. This improved JSON format could look like the following:
One could also dream to improve each author item by splitting the firstname and lastname, and why not by distinguishing an individual author and a collaboration:
Maybe the submitter field could also be improved the same way.