Closed cliuxinxin closed 2 years ago
Having support for Chinese would be amazing and although I only speak about ten words of Chinese (and can't write them!) I'd still try and give you whatever support I could to add it.
In https://github.com/explosion/holmes-extractor/tree/master/holmes_extractor/lang you can see the two directories where the language-specific code for English and German is located. Ideally you should be able to add a directory https://github.com/explosion/holmes-extractor/tree/master/holmes_extractor/lang/zh modelled on https://github.com/explosion/holmes-extractor/tree/master/holmes_extractor/lang/en, add the Chinese code and everything should work. In practice, however, because English and German are typologically similar and genetically related, it may be that some features are currently in the language-independent code that should actually be in the English-specific and German-specific code as they don't apply to a language like Chinese. I believe you would find any such features in the process of adding and testing a Chinese directory, and I could help to move them from the language-independent code into the English and German directories.
At present Holmes expects Coreferee models to be available for all spaCy models it loads. If you wanted Holmes to work with Chinese supporting coreference resolution, the first step would actually be to add Chinese support for Coreferee (https://github.com/explosion/coreferee#adding-support-for-a-new-language). I realise that this may seem a very big step, though, and if you are not that interested in coreference resolution and don't want to do this, you could start off by commenting out the code that tries to load the Coreferee model (https://github.com/explosion/holmes-extractor/blob/master/holmes_extractor/manager.py#L127) and instantiating the Holmes manager with perform_coreference_resolution=False
. I plan to make loading the Coreferee model optional in the next version of Holmes in any case to enable people to use Holmes with custom models with the languages that are already supported.
Thank you for your contribution to the open source.
I took a little look at the code and found it to be more than I could handle, both for the code and for my language skills.
Chinese is character-base, each characters has its own meaning, characters combined become words, which take on new meanings.
For example, 地 (ground) means ground. But if you add the 球, which represents a ball, 地球,it becomes earth.
If I want to support Chinese, then what do I need to do.