trevorld / Hanzi_Stats

Anki plugin which calculates number of Hanzi you have learned so far.
GNU General Public License v3.0
14 stars 4 forks source link

Cantonese #22

Open Kirchheim opened 7 months ago

Kirchheim commented 7 months ago

Hi,

Is there a way to add some statistics on Cantonese? I am always amused to see 咗 among the Unlisted characters, and when you check it out in Hanzi Stats, it says "very uncommon character." But 咗 is about as common in Cantonese as 了 is in Mandarin. Funny enough, the character is still missing from the cc-canto dictionary, but I think they will add it, since I reported it.

If nothing else is available, this could be a source: https://www.reddit.com/r/Cantonese/comments/62i3ud/most_common_cantonese_words_frequency_list/

I guess the characters making up the most common Cantonese words is a fair proxy for the most common Cantonese words.

Anyways, it's a really nice tool, and thanks for the recent update and maintenance.

Maik

trevorld commented 7 months ago
Kirchheim commented 7 months ago

Do they write Cantonese with simplified characters in mainland China and traditional characters in Hong Kong?

yes

so 6,000 most common Cantonese characters may be nice (but maybe not necessary).

this is most probably true

Fork is not worth it, written Cantonese follows much of Mandarin

Only spoken Cantonese differs a lot (maybe also limited interest)

Maybe what would be useful is a list of Cantonese only characters, disabled by default, and complementing the 6,000 most common simplified characters and 6,000 most common traditional characters already implemented for those who are curious

If someone could find a "good" spreadsheet/table of the most common Cantonese characters (not just words) I would consider it.

I will keep a lookout and forward it when I find it

trevorld commented 7 months ago

I am always amused to see 咗 among the Unlisted characters, and when you check it out in Hanzi Stats, it says "very uncommon character."

FYI: You can configure which URL is used to look up characters in Hanzi Stats e.g. you could configure it to use https://cccanto.org/search.php?q= instead of the default http://hanzicraft.com/character/ and it should then look up clicked-on characters in cc-canto...

Maybe what would be useful is a list of Cantonese only characters

That could maybe work although it might be a bit awkward if it is a big list that mixes really common Cantonese with merely not-super-rare Cantonese. I try to sometimes study all the characters in a category and not sure how much we'd want to encourage studying some of the less common Cantonese-only characters instead of the more useful Simplified/Traditional characters also used in Cantonese?

Kirchheim commented 7 months ago

FYI: You can configure which URL is used to look up characters in Hanzi Stats e.g. you could configure it to use https://cccanto.org/search.php?q= instead of the default http://hanzicraft.com/character/ and it should then look up clicked-on characters in cc-canto…

I think I will go for that in my Cantonese set, that’s a very flexible approach - much appreciated, thanks.

That could maybe work although it might be a bit awkward if it is a big list that mixes really common Cantonese with merely not-super-rare Cantonese. I try to sometimes study all the characters in a category and not sure how much we'd want to encourage studying some of the less common Cantonese-only characters instead of the more useful Simplified/Traditional characters also used in Cantonese?

Kirchheim commented 7 months ago

Hi,

interesting development, got a reply from Words.hk http://words.hk/, so would this list work? https://words.hk/faiman/analysis/

Maik

PS

Words.hk http://words.hk/ full email for details see below:

Words.hk attached a file, but its size 6,5 MB exceeds the github limit, my first email with that file attached bounced.

So here is a dropbox link to the file: https://www.dropbox.com/scl/fi/g8oos94vlcxpadzm2mjsr/lihkg-monthly-wordfreqs.csv.xz?rlkey=i6x224o4vkbqzhczockxoi5ay&dl=0

Hi Maik,

Thanks for your inquiry. Are these lists what you are looking for? https://words.hk/faiman/analysis/ The frequency list is based on a relatively small corpus and contains few authors, but hopefully it's better than using the Mandarin lists for Cantonese purposes.

Internally, we also have compiled word frequencies from "LIHKG" (a Hong Kong forum), which has order of magnitudes more data, but due to the audience there, the stuff tends to be on the vulgar side :D I'll attach the data in case that helps you. (I don't think we "own" this data at all, so as far as licensing goes, words.hk disclaims any ownership and responsibility for this one.)

trevorld commented 7 months ago

would this list work? https://words.hk/faiman/analysis/

The https://words.hk/faiman/analysis/charcount/ list looks like it may work after removing punctuation and non-Hanzi glyphs and resorting based on the second column.

So here is a dropbox link to the file...

This appears to be a word frequency list. It is more convenient to have a character frequency list. I guess if it includes every "word" it may not be impossible to convert it to a character frequency list but that would take more work and we'd need to know that words that are contained in longer words aren't double counted e.g. if we count the "word" 一場嚟到 do we not also count the "word" 一場?

Kirchheim commented 7 months ago

I think the word frequency list is just something they tossed in since it was available, and they are not sure what I was looking for.

Well I think they have two collections of Cantonese material they have analyzed. I think https://words.hk/faiman/analysis/ is from a smaller corpus of Cantonese material and the one they e-mailed you was from a different, larger corpus of Cantonese material...

Kirchheim commented 7 months ago

Before putting any effort into the character frequency list, I suggest to wait for a reply from Words.hk http://words.hk/ on a quick-check I raised with them today:

Hi,

yes, indeed, these are the lists I am looking for.

A quick check on the character frequency list for 咗 gives 咗,8529

Hmm, would you agree? I wouldn’t.

Is the corpus based on written only texts?

Thanks, Maik

The point is, if these are character frequencies for written only material, there is no need to build a separate Cantonese set of tables.

The spoken Cantonese, or more appropriately the colloquial Cantonese and its characters are what is of interest to Cantonese learners.

Let’s see, maybe I got it wrong.

Maik

Am 07.12.2023 um 21:54 schrieb Maik Braun @.***>:

I think the word frequency list is just something they tossed in since it was available, and they are not sure what I was looking for. Word frequency lists I guess are used by some students to prioritize vocabulary learning, but that is not really the focus of hanzi stats, right?

Myself I don’t use word frequency lists for orientation, instead I decide on a text, and then I learn all the characters in that text, and the words only with their meaning in that context

I just forwarded it for interest, you never know when it will come in handy or who might be interested….

So that leaves the https://words.hk/faiman/analysis/ to work with

Am 07.12.2023 um 21:14 schrieb Trevor L Davis @.***>:

would this list work? https://words.hk/faiman/analysis/

The https://words.hk/faiman/analysis/charcount/ list looks like it may work after removing punctuation and non-Hanzi glyphs and resorting based on the second column.

So here is a dropbox link to the file...

This appears to be a word frequency list. It is more convenient to have a character frequency list. I guess if it includes every "word" it may not be impossible to convert it to a character frequency list but that would take more work and we'd need to know that words that are contained in longer words aren't double counted e.g. if we count the "word" 一場嚟到 do we not also count the "word" 一場?

— Reply to this email directly, view it on GitHub https://github.com/trevorld/Hanzi_Stats/issues/22#issuecomment-1846043422, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPHBHL6OLJR5YCKPSIDVMTYIIPSJAVCNFSM6AAAAABAGLNZFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBWGA2DGNBSGI. You are receiving this because you authored the thread.

trevorld commented 7 months ago

1) It seems https://words.hk/faiman/analysis/ may be from the Hong Kong Cantonese Corpus (HKCanCor) https://github.com/fcbond/hkcancor which seems it was from transcribed conversations. 2) The one you forwarded me seems like it was from "LIHKG" (a Hong Kong forum) which seems like it would have been written.

The spoken Cantonese, or more appropriately the colloquial Cantonese and its characters are what is of interest to Cantonese learners.

I don't know that much about Cantonese so I can't really evaluate this. I know there is a bunch of material written in "Cantonese"...

Kirchheim commented 6 months ago

Hi,

the reply from Words.hk is

咗,8529 seems about right. Note that it is often written also as 「左」 due to the two characters being homophone.

The corpus is based on written only texts.

Well, frankly, I have 600 ANKI cards that show up when searching for 咗, and one of the most basic sentences in Cantonese is 食咗啦 I have eaten.

If I prioritize learning 8000 more frequent characters before turning to this one, it will be a long time before I can read my basic cards entirely.

To summarize - I don’t think this Cantonese frequency list adds value to learning characters needed for spoken Cantonese. As for written Cantonese, it may differ somewhat from Mandarin, but the difference may not be worth anything when it comes to studying characters.

So, based on this short experience of finding a frequency list for Cantonese, I am inclined to drop the idea entirely.

Regards, Maik

Kirchheim commented 6 months ago

Hi,

would something like this work instead? http://www.cantonese.sheik.co.uk/scripts/masterlist.htm?action=cantonese

I am afraid the frequency classification is only be Level, and not scientific, as explained on the site. I posted a question in the forum, if anything with frequency is available.

Just for curiosity, he is giving the character 咗 a L2, which is more in line with my own observations.

Maik

trevorld commented 6 months ago

would something like this work instead? http://www.cantonese.sheik.co.uk/scripts/masterlist.htm?action=cantonese

I don't see a straightforward way to get separate lists of cantonese by level. I don't fully understand what "level 0" is supposed to represent. Are the "level 0" metadata errors?

Kirchheim commented 6 months ago

ok, let’s see if there are any useful replies to my post on their forum

Kirchheim commented 6 months ago

https://words.hk/faiman/analysis/ as a basis for Cantonese character frequency list

Hi,

with the below clarifications, it is fair to say that the above list is suitable for a Cantonese character frequency list:

  1. Words.hk http://words.hk/ confirmed that it is from written corpus
  2. Hambaanglaang told me they have worked with Words.hk http://words.hk/ for a list I see that you are in touch with words.hk. They have the updated list of what we started that has been sent to you. Our Readers were graded both by Cantonese (for the Cantonese story) and CEFR for the English text.

Additional info from Hambaanglaang: https://hambaanglaang.hk/software-tool/#LS%20Online Kindly Scroll down to the text editor software. Have a look please at its manual on how to use it. Our HBL graded level lists are built into this text editor.

Nouns are typically not included in the frequency calculations as they are hard to classify and 10% of the words outside the graded level are accepted as necessary for the storyline to be coherent at the lower levels.
Typically if you paste the Box story's text into the editor and select the level you should get a 85 plus percent score as within the level its written for. That's how we did it. Keeping it cheap and cheerful helped us produce 2 stories a week.

I had asked Words.hk http://words.hk/, why the character 咗 has only a rank of 8500 but appears everywhere in spoken and written Cantonese I got the following reply from Words.hk http://words.hk/ :

The word frequency is the number of occurrences seen in the corpus, not the rank. What I'm seeing is that 咗 is 8529 and 左 is 7146, adding up to 15xxx, which is < 30 on the relative scale if you sort the list.

This means that the list from words.hk http://words.hk/ is the most suitable.

Best, Maik

Kirchheim commented 6 months ago

https://words.hk/faiman/analysis/ best basis for Cantonese character frequency list

background info on this project Proceedings http://www.lrec-conf.org/proceedings/lrec2022/workshops/DCLRL/pdf/2022.dclrl-1.7.pdf

Hi,

  1. Words.hk confirmed that the frequency list is from written corpus.
  2. Hambaanglaang https://hambaanglaang.hk/ told me they have worked with Words.hk for the list:

"I see that you are in touch with words.hk. They have the updated list of what we started that has been sent to you. Our Readers were graded both by Cantonese (for the Cantonese story) and CEFR for the English text."

Additional info from Hambaanglaang: https://hambaanglaang.hk/software-tool/#LS%20Online Kindly Scroll down to the text editor software. Have a look please at its manual on how to use it. Our HBL graded level lists are built into this text editor.

Nouns are typically not included in the frequency calculations as they are hard to classify and 10% of the words outside the graded level are accepted as necessary for the storyline to be coherent at the lower levels.
Typically if you paste the Box story's text into the editor and select the level you should get a 85 plus percent score as within the level its written for. That's how we did it. Keeping it cheap and cheerful helped us produce 2 stories a week.

I had asked Words.hk, why the character 咗 has only a rank of 8500 but appears everywhere in spoken and written Cantonese. I got the following reply from Words.hk : The word frequency is the number of occurrences seen in the corpus, not the rank. What I'm seeing is that 咗 is 8529 and 左 is 7146, adding up to 15xxx, which is < 30 on the relative scale if you sort the list.

Maik

— Reply to this email directly, view it on GitHub https://github.com/trevorld/Hanzi_Stats/issues/22#issuecomment-1847854994, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPHBHPF2DEHBKTCBZT5TPTYIN75JAVCNFSM6AAAAABAGLNZFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBXHA2TIOJZGQ. You are receiving this because you authored the thread.