Array of authors in JSON serialization of circulars

gmantele commented 1 year ago

I have met @Courey during ADASS XXXIII (Tucson, Arizona) who suggested me to submit a GitHub issue with the following idea of amelioration regarding the JSON format.

The current JSON serialization of GCN circulars is pretty basic and is really close to the email/text serialization. It makes it quite hard to automatically parse, especially for the authors list. All the authors are given in the first paragraph(s), with two different syntaxes:

comma separated list of authors with the affiliation for all of them at the end (e.g. 35191)
comma separated list of authors with individual affiliation between parentheses (e.g. 35197).

Let's take the example of the circular 35197:

{
    "subject": "GRB 231127A: AstroSat CZTI detection",
    "submittedHow": "web",
    "createdOn": 1701084481618,
    "circularId": 35197,
    "submitter": "Gaurav Waratkar at IIT Bombay <gauravwaratkar@iitb.ac.in>",
    "body": "P. K. Navaneeth (IUCAA), G. Waratkar (IITB), A. Vibhute (IUCAA), V. Bhalerao (IITB), D. Bhattacharya (Ashoka University/IUCAA), A. R. Rao (IUCAA/TIFR), and S. Vadawale (PRL) report on behalf of the AstroSat CZTI collaboration:\n\nAnalysis of AstroSat CZTI data with the CIFT framework (Sharma et al., 2021, JApA, 42, 73) showed the detection of a long-duration GRB 231127A which was also detected by CALET (Trigger Num. 1385097164).\n\nThe source was clearly detected in the CZT detectors in the 20-200 keV energy range. The light curve peaks at 2023-11-27 05:14:21.75 UTC. The measured peak count rate associated with the burst is 433 (+159, -60) counts/s above the background in the combined data of all quadrants, with a total of 571 (+157, -130) counts. The local mean background count rate was 325 (+7, -11) counts/s. Using cumulative rates, we measure a T90 of 3.0 (+1.4, -0.8) s.\n\nThe source was also clearly detected in the CsI anticoincidence (Veto) detector in the 100-500 keV energy range. The light curve peaks at 2023-11-27 05:14:21.69 UTC. The measured peak count rate associated with the burst is 822 (+78, -76) counts/s above the background in the combined data of all quadrants, with a total of 2145 (+310, -342) counts. The local mean background count rate was 1452 (+11, -12) counts/s. We measure a T90 of 5 (+2, -3) s from the cumulative Veto light curve. We note that this T90 measurement has higher uncertainty due to the intrinsic 1 s binning of Veto data. \n\nCZTI is built by a TIFR-led consortium of institutes across India, including VSSC, URSC, IUCAA, SAC, and PRL. The Indian Space Research Organisation funded, managed, and facilitated the project.\n\nCZTI GRB detections are reported regularly on the payload site at:\nhttp://astrosat.iucaa.in/czti/?q=grb\n"
}

It would be really great if authors could be split into individual items and put in a separate field of type array. This improved JSON format could look like the following:

{
    "subject": "GRB 231127A: AstroSat CZTI detection",
    "submittedHow": "web",
    "createdOn": 1701084481618,
    "circularId": 35197,
    "submitter": "Gaurav Waratkar at IIT Bombay <gauravwaratkar@iitb.ac.in>",
    "authors": [
        { "name": "P. K. Navaneeth", "institute": "IUCAA" },
        { "name": "G. Waratkar", "institute": "IITB" },
        { "name": "A. Vibhute", "institute": "IUCAA" },
        { "name": "V. Bhalerao", "institute": "IITB" },
        { "name": "D. Bhattacharya", "institute": "Ashoka University/IUCAA" },
        { "name": "A. R. Rao", "institute": "IUCAA/TIFR" },
        { "name": "S. Vadawale", "institute": "PRL" },
        { "name": "AstroSat CZTI collaboration" } 
    ],
    "body": "..."
}

One could also dream to improve each author item by splitting the firstname and lastname, and why not by distinguishing an individual author and a collaboration:

{
    "subject": "GRB 231127A: AstroSat CZTI detection",
    "submittedHow": "web",
    "createdOn": 1701084481618,
    "circularId": 35197,
    "submitter": "Gaurav Waratkar at IIT Bombay <gauravwaratkar@iitb.ac.in>",
    "authors": [
        { "firstname": "P. K.", "lastname": "Navaneeth", "institute": "IUCAA" },
        { "firstname": "G.", "lastname": "Waratkar", "institute": "IITB" },
        { "firstname": "A.", "lastname": "Vibhute", "institute": "IUCAA" },
        { "firstname": "V.", "lastname": "Bhalerao", "institute": "IITB" },
        { "firstname": "D.", "lastname": "Bhattacharya", "institute": "Ashoka University/IUCAA" },
        { "firstname": "A. R.", "lastname": "Rao", "institute": "IUCAA/TIFR" },
        { "firstname": "S.", "lastname": "Vadawale", "institute": "PRL" },
        { "collaboration": "AstroSat CZTI collaboration" } 
    ],
    "body": "..."
}

Maybe the submitter field could also be improved the same way.

lpsinger commented 1 year ago

@gmantele, we would love to do this! We need to have the author list broken apart and the author names broken into family name and given name in order to mint DOIs (see #299).

We thought about parsing the author lists that people traditionally enter into the Circular bodies, but we have hesitated to do that due to the subtleties of names across cultures (i.e., multiple last names, last names with prepositions, etc.).

The alternative is to allow users to enter that data themselves when they are composing their Circulars, perhaps by allowing them to select other GCN users as co-authors, and perhaps with some auto-completion.

So... this seems complicated. Do you have any ideas?

gmantele commented 1 year ago

Thanks for your quick answer.

We are doing this automatic parsing of the author lists, and I can confirm that it is not trivial. We generally do some parsing and then names are always checked (and eventually fixed) by humans. This is clearly not reliable. Hence this GitHub Issue.

I have never submitted any circular, so, I kind of hoped that you got the author names in some kind of HTML form, and then you decided to format the circular body as it is now. But apparently, you do not ; I assume the circular submission form is a free text area in which authors decide to write authors and actual body the way it is now.

My idea of improvement would then be to update the submission form so that authors can be individually input (with or without a separate field for the firstname and lastname). But I also assume the HTML form has been made on purpose like the way it is currently, isn't it? If yes, this idea would probably not be a valid option for you.

lpsinger commented 1 year ago

Currently, the form has a subject field and a body field, and that's it. By convention, the author list is the first paragraph in the body. That's because originally GCN Circulars were just emails. We don't have to be limited by that now, but as you can imagine we want to keep the data entry burden as low as possible for users given that GCN Circulars are supposed to be rapid communications! So we don't want to design a user interface that requires a user to manually enter the first name, last name, and affiliation of every single author into a table. And we also want to integrate with ORCID. This is mostly a UX design problem.

gmantele commented 1 year ago

That's indeed what I assumed and it makes sense.

Then, unfortunately I don't have a good solution for the moment :(

Courey commented 1 year ago

How difficult would it be to change only the request that the authors be written like so:

first_name: John last_name: Doe affiliation: ABCD, first_name: Jane last_name: Doe affiliation: EFG

that would at least give you something to dependably pattern match. It IS a request on the user to type additional words, but in the same way they do currently in the first paragraph. Then we could parse it on the back end as we create it and add an array of formatted authors to the circular if the pattern matcher gets any matches. Basically your circular would only be cite-able if you conform to the standard properly.

I do not know if the draw of being able to be cited would encourage people to use the proper pattern, but it might be worth the tradeoff. And it's still pattern matching and parsing, which isn't ideal since people would have to be precise and we are relying on humans to do so. Could the value of citation and ease of processing be enough to encourage conforming to a standard if widely publicized enough?

gmantele commented 12 months ago

It looks like a good idea to me. And it can even be a simpler syntax (a comma as first-last name separator, affiliations between parenthesis, separated with a semi-colon or comma if multiple ; and finally a semi-colon to separate authors):

John, Doe (ABCD) ; Jane, Doe (EFG)

It is still formatted while being simple to write as well as humanly readable. However, one has probably to think about some corner cases:

collaboration (no first name, no last name, no affiliation ; how to list authors being in the collaboration?),
same affiliation for multiple authors (solutions: 1/ affiliation reference (but become almost as complicated as in LaTeX), 2/ duplication for each author (redundant, but quite easy to do))
...

lpsinger commented 1 day ago

Duplicate of #893.

nasa-gcn / gcn.nasa.gov

Array of authors in JSON serialization of circulars #1707