Open sati-bodhi opened 5 years ago
To train the parser model we would add training data like this:
<sequence>
<author>舒仁輝,</author><title>《東都事略與宋史比較研究》。</title><location>北京:</location><publisher>商務,</publisher><date>2007年。</date>
</sequence>
To a data-set (like the core set used by the default model). To improve the results further we may want to add Chinese-specific information to the feature/normalizer code (for example, for words which strongly indicate certain segments, such as 年
being a strong indicator for date
).
Most importantly, however, we need to add tokenization rules. The parser currently splits each references into tokens using whitespace, but this won't work in Chinese I assume? Taking your example, we'd have to split on ,
, 。
, and :
-- can you tell me a reliable or common way how Chinese sentences would be split?
We use punctuations, primarily.
《》
for book-title
〈〉
for article-title
()
to separate (location: publisher, year)
information in the footnote. (The above example without parenthesis belongs to the bibliography. )
Different segments of the bibliographic data would normally be separated by either the above, or commas: ,
and 、
.
、
is usually an indicator of continuation, so a work with 2 authors could be cited as: author1、author2,《title》(location:publisher, year)
(But this is not always the case, due to a lack of consensus. )
Note that Chinese punctuations don't usually follow with a space `; and can either be full-width
,or half-width
,`.
There are also certain words, like ‘年‘, that indicate strongly what the preceding strings mean:
著
and/or 撰
for author
編
for editor
譯
for translator
收入
is similar to in
in English citations, which indicates what follows is a book-title or series-title to which the preceding article or chapter-title belongs.
If you really want to split Chinese strings into sementically sound words, you'll need to make use of word segmentation systems such as this.
I think this is not absolutely necessary as far as citational parsing is concerned.
Ruby has its own Chinese word-tokenizer over here.
Could you build me a fork with the relevant changes so that I can build on from there?
Yes, I will try to add your information above to the tokenizer this week!
I've added preliminary support for this to anystyle
; it's already available on master
if you want to play around with it. But I will make release including these changes in the next couple of days and, if you like, post more detailed instructions for you here.
I've trained your example above and this seems to be working fine:
But we will have to add more examples to the training set in order to see how viable this approach is.
PS: never mind the language being detected as pl
in the screenshot above; the language detection is an optional Gem that delivers weird results sometimes!
I've created a generic citation style for humanities Chinese to be used with Zotero. If you are familiar with Zotero, you can use this to create bibliographic samples for testing and/or training. Just change the file extension to .csl
and install. I can send you a sample for each item type if you like.
Hey, I've got another set of sample Japanese bibliography data (similiar, but slightly different from the previous style) which I need to parse. Can you provide instructions with the training? Thanks!
Is it possible to train
anystyle
to recognise citations and bibliographic entries formatted in Chinese?For example:
translates to:
How would the training file look like?