inukshuk / anystyle-cli

AnyStyle Command Line Interface
BSD 2-Clause "Simplified" License
57 stars 8 forks source link

Train Chinese citation model #7

Open sati-bodhi opened 5 years ago

sati-bodhi commented 5 years ago

Is it possible to train anystyle to recognise citations and bibliographic entries formatted in Chinese?

For example:

舒仁輝,《東都事略與宋史比較研究》。北京:商務,2007年。

translates to:

@book{舒仁輝DongDuShiLueYuSongShiBiJiaoYanJiu2007,
  language = {zh},
  location = {{北京}},
  title = {東都事略與宋史比較研究},
  publisher = {{商務}},
  date = {2007},
  author = {{舒仁輝}}
}

How would the training file look like?

inukshuk commented 5 years ago

To train the parser model we would add training data like this:

<sequence>
  <author>舒仁輝,</author><title>《東都事略與宋史比較研究》。</title><location>北京:</location><publisher>商務,</publisher><date>2007年。</date>
</sequence>

To a data-set (like the core set used by the default model). To improve the results further we may want to add Chinese-specific information to the feature/normalizer code (for example, for words which strongly indicate certain segments, such as being a strong indicator for date).

Most importantly, however, we need to add tokenization rules. The parser currently splits each references into tokens using whitespace, but this won't work in Chinese I assume? Taking your example, we'd have to split on , , and -- can you tell me a reliable or common way how Chinese sentences would be split?

sati-bodhi commented 5 years ago

We use punctuations, primarily. 《》 for book-title 〈〉 for article-title () to separate (location: publisher, year) information in the footnote. (The above example without parenthesis belongs to the bibliography. ) Different segments of the bibliographic data would normally be separated by either the above, or commas: and . is usually an indicator of continuation, so a work with 2 authors could be cited as: author1、author2,《title》(location:publisher, year) (But this is not always the case, due to a lack of consensus. )

Note that Chinese punctuations don't usually follow with a space `; and can either be full-widthor half-width,`.

There are also certain words, like ‘年‘, that indicate strongly what the preceding strings mean: and/or for author for editor for translator

收入 is similar to in in English citations, which indicates what follows is a book-title or series-title to which the preceding article or chapter-title belongs.

If you really want to split Chinese strings into sementically sound words, you'll need to make use of word segmentation systems such as this.

I think this is not absolutely necessary as far as citational parsing is concerned.

sati-bodhi commented 5 years ago

Ruby has its own Chinese word-tokenizer over here.

sati-bodhi commented 5 years ago

Could you build me a fork with the relevant changes so that I can build on from there?

inukshuk commented 5 years ago

Yes, I will try to add your information above to the tokenizer this week!

inukshuk commented 5 years ago

I've added preliminary support for this to anystyle; it's already available on master if you want to play around with it. But I will make release including these changes in the next couple of days and, if you like, post more detailed instructions for you here.

I've trained your example above and this seems to be working fine:

screenshot from 2018-09-14 13-32-00

But we will have to add more examples to the training set in order to see how viable this approach is.

PS: never mind the language being detected as pl in the screenshot above; the language detection is an optional Gem that delivers weird results sometimes!

sati-bodhi commented 5 years ago

I've created a generic citation style for humanities Chinese to be used with Zotero. If you are familiar with Zotero, you can use this to create bibliographic samples for testing and/or training. Just change the file extension to .csl and install. I can send you a sample for each item type if you like.

humanities-chinese.txt

sati-bodhi commented 5 years ago

野口善敬書目.txt

Hey, I've got another set of sample Japanese bibliography data (similiar, but slightly different from the previous style) which I need to parse. Can you provide instructions with the training? Thanks!