nieldlr / hanzi

HanziJS is a Chinese character and NLP module for Chinese language processing for Node.js
http://hanzijs.com
MIT License
375 stars 56 forks source link

Further collaboration #20

Open tony opened 10 years ago

tony commented 10 years ago

This is just ideas I've been jotting: http://cihai.readthedocs.org/en/latest/spec.html.

I'm still bike-shedding this and also in the process of testing this spec on my own python library.

There are some areas where I'm going to plow forward and see if something comes up (such as only using .get then returning keys based on 'hits' returned from middleware), then a .reverse to find reverse lookups by decomposition, radical, definitions, etc. I think that will generate skepticism, but I want to put it to the test first. When I have more data / an implementation of it working, I can come back and articulate what the experience is.

Here are some sure things I think, what do you think of these?:

  1. Neither the name of the nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
    • Hanzi should have an extension feature for adding libraries. I will articulate on the reasoning for this more, it has some history to it.

Plugins can still be available under BSD or MIT licenses (IANAL).

Here are some things I could use help on:

Edit: rewording.

tony commented 10 years ago

I am still hashing over things on my end. I will update here next week.

nieldlr commented 10 years ago

Hey @tony,

sorry for the late reply here. Been traveling around and having holidays. I'll look at this this weekend when I'm back home! Looks exciting.

tony commented 10 years ago

@nieldlr Hey, there's no need to read over that. I've been iterating over this often so it will probably take me another week or two until I have more.

As an update to what I'm experimenting with now:

Trying to normalize the raw datasets and make them consistent / compatible with relational / table-based lookups.

Take decomposition (cjk-decomp.txt, formerly known as groovy.csv) for example:

https://github.com/cburgmer/cjklib/blob/master/scripts/convertdecomposition.py Script used to convert cjk-decomp.txt (which was formerly known as groovy.csv).

The result: https://github.com/cburgmer/cjklib/blob/master/cjklib/data/characterdecomposition.csv

This saves the effort of having to do expensive/timely lookups ahead of time.

I want to rewrite cjklib's convertdecomposition.py to convert the new cjk-decomp.txt format. I may pick an easier example, I'm trying to repeat it with datasets such as https://github.com/nieldlr/Hanzi/tree/master/lib/dicts.

I will ping back when I have more to show.

albertolovell commented 10 years ago

@tony Hey, I've been following this repository for a while hoping that it would get pushed along. Do you have any foundational stuff that you need a hand with? My programming skills are novice at best, but I have a lot of extra time on my hands. Let me know, I would love to help out.

tony commented 10 years ago

@beituo : That is awesome to hear!

I am trying to 1.) Create a list of all the CJK sources, their licenses 2.) Create a python script to convert them to an idiomatic, table-like format.

There is one coding thing in particular I am looking to have off my chest:

Character decomposition:

Importance level: 99 Difficulty: Medium-Hard Time: Medium-Long

Character decomposition is one of the coolest parts of the hanzi language.

https://github.com/cburgmer/cjklib/blob/master/scripts/groovyset.csv is a CSV for decomposing chinese characters.

https://github.com/cburgmer/cjklib/blob/master/scripts/convertdecomposition.py is a script that turns this set into relational-friendly CSV file.

You can see it at https://github.com/cburgmer/cjklib/blob/master/cjklib/data/characterdecomposition.csv.

The difficulty level on this one is hard and time consuming. The issue is the new decomposition set has been changed from groovy.csv to https://github.com/nieldlr/Hanzi/blob/master/lib/dicts/cjk-decomp.txt. The instructions for the new format they use is at http://cjkdecomp.codeplex.com/wikipage?title=cjk-decomp&referringTitle=Home.

This would involve novice python skills, but it takes time to wrap one's brain around how cjk-decomp.txt works. It's probably a superb learning exercise. I could help you with it if you want to go at it.

Exploring / Charting the internet for what Chinese Datasets are available and their copyrights:

Importance level: 99 Difficulty: Easy Time: Medium

Right now it's a tad ambigious. The README.rst of Hanzi and the tops of the files in cjklib/data + https://github.com/cburgmer/cjklib/issues/3 touches this. We need to create a gist and put all this information in one place.

Compiling a list of CJK datasets, where they are from, the author, the license (if any), whether they are the raw dataset (or is it a derivative script used to make them more friendly parse). The issue is there are a few cjk projects and they are pulling in data sets from various places, it's not readily apparent that the copyright on them is.

https://github.com/nieldlr/Hanzi/tree/master/lib/dicts https://github.com/nieldlr/Hanzi/tree/master/lib/data https://github.com/cburgmer/cjklib/tree/master/cjklib/data https://github.com/bytesource/chinese_vocab/tree/master/hsk_data

Ideally, we want Hanzi, Cihai, chinese_vocab, etc. to all be able to have access to the same data sets with a permissive license.

Common data format / structure:

After the various from all of the internet have been charted, datasets which don't have a clear license picked need to have authors contacted to see if they will open them up.

Then, I want to try to define a common data structure for how the data in these sets should be. Then create python scripts for them to converted to that spec.

I want to be able to, as much as possible, have a structure such as


char    columnName    columnName

So cjk scripts such as Hanzi (node), Cihai (python), Chinese::Vocab (ruby) can be able to reliably have access to the datasets via a common format.

albertolovell commented 10 years ago

Ok, I'll look this over and see how I can help. Will definitely need a few pointers from your end.. Get back to you within 48 hours.

albertolovell commented 10 years ago

@tony Ok, I am going to do what I can from my end. I have someone giving a few pointers, but chances are I will need your input. Is it better to reach you here or by email?

tony commented 10 years ago

@beituo tony at git-pull d com

are you on freenode or gtalk? let me know via email

nsonnad commented 10 years ago

Hey all, I'd love to help out with this if we can figure out where we can best spend our time. I think the ideal course of action would be to:

  1. Track down datasets and determine which ones we can use, asking people to open them up if necessary.
  2. Create a dedicated repo that collects said source files (or fetch them with a script, etc)
  3. Decide on a consistent, tabular structure for the datasets
  4. Write scripts to convert source files to desired data structure
  5. Other projects (Hanzi etc) can use the data in this repo as they see fit

I'm on Australia time so not sure how that gels with y'all, but feel free to shoot me an email: nikhil at theoldbeggar dot com (or tweet nsonnad)

tony commented 10 years ago

@beituo , @nsonnad , @cburgmer, @nieldlr :

I've been studying and overlooking this a bit more. To make this into a "we" effort. I created a new org at https://github.com/cihai and #cihai on freenode. I'm working on documentation + creating a clear spec.

I realize after doing this nearly full time for the past month and a half, this is too much work to do alone. It is too much work for one person to do alone and keep quality consistent.

As @nsonnad said:

Track down datasets and determine which ones we can use, asking people to open them up if necessary.

[I am going to create update this issue with a link to where I'm keeping a table of this data. If ou're reading this by email, come back to the issue when I reply this is updated.]

Create a dedicated repo that collects said source files (or fetch them with a script, etc)

https://github.com/cihai

https://github.com/cihai/cihaidata-unihan as an example. Still working on cleaning, organizing it, work in progress. I considering decouple all cihai-python specific code from it, so it may just end up as a plain data package.

It includes a script/process.py that downloads the file, processes it into the datapackage format.

Rinse, and repeat for other datasets. This gives us a template / boilerplate to create consistent, high quality cjk datasets.

Dataprotocol's docs show a python script doing the processing. I think python is the best choice, using only the standard library. The unihan example I picked will have a download-meter, support for python 2.7 + 3.3.

I'm considering also making a script to generate a boilerplate for this. Dataprotocols also has one of their own, but I want to make sure script/process.py is works on py 2.7 (with unicode_literals), 3.3 and only uses standard library. That will make sure process.py supported on most system machines.

I'm still clarifying whether data/datafile.csv includes the output produced from script/process.py in the repo for stable releases.

Decide on a consistent, tabular structure for the datasets

https://github.com/cihai/cihai-handbook#standards - Basically, using https://github.com/dataprotocols/dataprotocols standards. .

Write scripts to convert source files to desired data structure

I want to complete documentation, make things clear as possible, and leave everyone upon to that.

Other projects (Hanzi etc) can use the data in this repo as they see fit

It is big enough of a duty that datasets' should be universal as possible. Across datasets, they can be consistent so relations in hash tables can analyze the data regardless of programming language. I see a win-win here.

Let me reply here when I have documentation updated.