inukshuk / anystyle

Fast citation reference parsing
https://anystyle.io
Other
1.05k stars 90 forks source link

Split/group authors #7

Closed rmzelle closed 10 years ago

rmzelle commented 10 years ago

For

  1. Abe, S., A. Furuya, T. Saito, and K. Takayama. November 1962. Method of producing L-malic acid by fermentation. U.S. patent 3,063,910.

all the individual name parts end up in their own bubble, whereas the JSON output just contains a single string:

"author":"1. Abe, S., A. Furuya, T. Saito, and K. Takayama"

Wouldn't it make more sense to identify entire names in both cases? I.e. "Abe, S." instead of "Abe," and "S.," after parsing, and "author":{"Abe, S.", "A. Furuya", etc.} in the JSON?

inukshuk commented 10 years ago

We do actually parse the names into their constitutent parts, but the simple JSON output right now does not reflect that. I actually wanted to promote the CiteProc/JSON format instead – if you use this you will see that the names are actually returned like this:

"author": [
  {"family":"Abe","given":"S."},
  {"family":"Furuya","given":"A."},
  {"family":"Saito","given":"T."},
  {"family":"Takayama","given":"K."}
]

It's also been suggested to use bibJSON instead of the simple JSON output; do you think this is a good idea or would you prefer to promote the CiteProc format?

rmzelle commented 10 years ago

I don't know enough about bibJSON, but the CiteProc format would obviously be preferred if you chain the anystyle-parser output to another CSL-based tool (e.g., one of my wishes is a tool that parses references and recommends closely matching CSL styles).

inukshuk commented 10 years ago

I was thinking of adding formatted reference list as an output option (with a CSL style selector).

For the comparison/prediction tool we could use the API – we'd just need a good method of comparison, but then this should be very easy. I'll see if I can come up with a quick prototype!

rmzelle commented 10 years ago

(Mendeley's CSL editor [http://editor.citationstyles.org/visualEditor/] currently requires users to reformat a fixed set of metadata in the desired format, which the tool then compares to prerendered output of all independent CSL styles. It would obviously be much more user-friendly if users could just copy and paste existing references in the desired format, and get recommendations based on those)

rmzelle commented 10 years ago

We do actually parse the names into their constitutent parts, but the simple JSON output right now does not reflect that.

And you use the simple JSON output to render the bubbles?

inukshuk commented 10 years ago

Not exactly. The parsing process goes something like this: tokenize -> combine tokens into segments/groups with a label -> normalize each segment. With bibliographic data the labelling step is typically the difficult one, because it is relatively easy, for example, to parse names, or dates when you are fairly certain that the string in question actually represents names, or dates and so forth.

So the rationale behind the editor is that we use a machine learning algorithm to find the segments and then put those into the editor for review. Then, when you click save, we apply the normalizer algorithms on the segments.

With the simple JSON format my idea was to return a result that will be close to your input with very little normalizations – that's why this output will typically be much closer to what is being rendered in the bubbles. I doubt that it is a very useful output – I guess most people would use either BibTeX or CiteProc depending on what they want to do with the data.

rmzelle commented 10 years ago

I asked because "and" shows up in among the "author" bubbles, which is a bit strange.