Open tony opened 10 years ago
I am still hashing over things on my end. I will update here next week.
Hey @tony,
sorry for the late reply here. Been traveling around and having holidays. I'll look at this this weekend when I'm back home! Looks exciting.
@nieldlr Hey, there's no need to read over that. I've been iterating over this often so it will probably take me another week or two until I have more.
As an update to what I'm experimenting with now:
Trying to normalize the raw datasets and make them consistent / compatible with relational / table-based lookups.
Take decomposition (cjk-decomp.txt, formerly known as groovy.csv) for example:
https://github.com/cburgmer/cjklib/blob/master/scripts/convertdecomposition.py Script used to convert cjk-decomp.txt (which was formerly known as groovy.csv).
The result: https://github.com/cburgmer/cjklib/blob/master/cjklib/data/characterdecomposition.csv
This saves the effort of having to do expensive/timely lookups ahead of time.
I want to rewrite cjklib's convertdecomposition.py
to convert the new cjk-decomp.txt
format. I may pick an easier example, I'm trying to repeat it with datasets such as https://github.com/nieldlr/Hanzi/tree/master/lib/dicts.
I will ping back when I have more to show.
@tony Hey, I've been following this repository for a while hoping that it would get pushed along. Do you have any foundational stuff that you need a hand with? My programming skills are novice at best, but I have a lot of extra time on my hands. Let me know, I would love to help out.
@beituo : That is awesome to hear!
I am trying to 1.) Create a list of all the CJK sources, their licenses 2.) Create a python script to convert them to an idiomatic, table-like format.
There is one coding thing in particular I am looking to have off my chest:
Importance level: 99 Difficulty: Medium-Hard Time: Medium-Long
Character decomposition is one of the coolest parts of the hanzi language.
https://github.com/cburgmer/cjklib/blob/master/scripts/groovyset.csv is a CSV for decomposing chinese characters.
https://github.com/cburgmer/cjklib/blob/master/scripts/convertdecomposition.py is a script that turns this set into relational-friendly CSV file.
You can see it at https://github.com/cburgmer/cjklib/blob/master/cjklib/data/characterdecomposition.csv.
The difficulty level on this one is hard and time consuming. The issue is the new decomposition set has been changed from groovy.csv to https://github.com/nieldlr/Hanzi/blob/master/lib/dicts/cjk-decomp.txt. The instructions for the new format they use is at http://cjkdecomp.codeplex.com/wikipage?title=cjk-decomp&referringTitle=Home.
This would involve novice python skills, but it takes time to wrap one's brain around how cjk-decomp.txt
works. It's probably a superb learning exercise. I could help you with it if you want to go at it.
Importance level: 99 Difficulty: Easy Time: Medium
Right now it's a tad ambigious. The README.rst of Hanzi and the tops of the files in cjklib/data + https://github.com/cburgmer/cjklib/issues/3 touches this. We need to create a gist and put all this information in one place.
Compiling a list of CJK datasets, where they are from, the author, the license (if any), whether they are the raw dataset (or is it a derivative script used to make them more friendly parse). The issue is there are a few cjk projects and they are pulling in data sets from various places, it's not readily apparent that the copyright on them is.
https://github.com/nieldlr/Hanzi/tree/master/lib/dicts https://github.com/nieldlr/Hanzi/tree/master/lib/data https://github.com/cburgmer/cjklib/tree/master/cjklib/data https://github.com/bytesource/chinese_vocab/tree/master/hsk_data
Ideally, we want Hanzi, Cihai, chinese_vocab, etc. to all be able to have access to the same data sets with a permissive license.
After the various from all of the internet have been charted, datasets which don't have a clear license picked need to have authors contacted to see if they will open them up.
Then, I want to try to define a common data structure for how the data in these sets should be. Then create python scripts for them to converted to that spec.
I want to be able to, as much as possible, have a structure such as
char columnName columnName
So cjk scripts such as Hanzi (node), Cihai (python), Chinese::Vocab (ruby) can be able to reliably have access to the datasets via a common format.
Ok, I'll look this over and see how I can help. Will definitely need a few pointers from your end.. Get back to you within 48 hours.
@tony Ok, I am going to do what I can from my end. I have someone giving a few pointers, but chances are I will need your input. Is it better to reach you here or by email?
@beituo tony at git-pull d com
are you on freenode or gtalk? let me know via email
Hey all, I'd love to help out with this if we can figure out where we can best spend our time. I think the ideal course of action would be to:
I'm on Australia time so not sure how that gels with y'all, but feel free to shoot me an email: nikhil at theoldbeggar dot com (or tweet nsonnad)
@beituo , @nsonnad , @cburgmer, @nieldlr :
I've been studying and overlooking this a bit more. To make this into a "we" effort. I created a new org at https://github.com/cihai and #cihai on freenode. I'm working on documentation + creating a clear spec.
I realize after doing this nearly full time for the past month and a half, this is too much work to do alone. It is too much work for one person to do alone and keep quality consistent.
As @nsonnad said:
Track down datasets and determine which ones we can use, asking people to open them up if necessary.
[I am going to create update this issue with a link to where I'm keeping a table of this data. If ou're reading this by email, come back to the issue when I reply this is updated.]
Create a dedicated repo that collects said source files (or fetch them with a script, etc)
https://github.com/cihai/cihaidata-unihan as an example. Still working on cleaning, organizing it, work in progress. I considering decouple all cihai-python specific code from it, so it may just end up as a plain data package.
It includes a script/process.py
that downloads the file, processes it into the datapackage format.
Rinse, and repeat for other datasets. This gives us a template / boilerplate to create consistent, high quality cjk datasets.
Dataprotocol's docs show a python script doing the processing. I think python is the best choice, using only the standard library. The unihan example I picked will have a download-meter, support for python 2.7 + 3.3.
I'm considering also making a script to generate a boilerplate for this. Dataprotocols also has one of their own, but I want to make sure script/process.py
is works on py 2.7 (with unicode_literals
), 3.3 and only uses standard library. That will make sure process.py
supported on most system machines.
I'm still clarifying whether data/datafile.csv
includes the output produced from script/process.py
in the repo for stable releases.
Decide on a consistent, tabular structure for the datasets
https://github.com/cihai/cihai-handbook#standards - Basically, using https://github.com/dataprotocols/dataprotocols standards. .
Write scripts to convert source files to desired data structure
I want to complete documentation, make things clear as possible, and leave everyone upon to that.
Other projects (Hanzi etc) can use the data in this repo as they see fit
It is big enough of a duty that datasets' should be universal as possible. Across datasets, they can be consistent so relations in hash tables can analyze the data regardless of programming language. I see a win-win here.
Let me reply here when I have documentation updated.
This is just ideas I've been jotting: http://cihai.readthedocs.org/en/latest/spec.html.
I'm still bike-shedding this and also in the process of testing this spec on my own python library.
There are some areas where I'm going to plow forward and see if something comes up (such as only using
.get
then returning keys based on 'hits' returned from middleware), then a.reverse
to find reverse lookups by decomposition, radical, definitions, etc. I think that will generate skepticism, but I want to put it to the test first. When I have more data / an implementation of it working, I can come back and articulate what the experience is.Here are some sure things I think, what do you think of these?:
Plugins can still be available under BSD or MIT licenses (IANAL).
Here are some things I could use help on:
Do you know of a recommendation / spec / standard / good examples of a schema for returning the chinese character(s) info? With one character? With a string?
I have some scribblings in there, but that's all open.
Edit: rewording.