hexenq / kuroshiro

Japanese language library for converting Japanese sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.
https://kuroshiro.org
MIT License
781 stars 88 forks source link

TypeError: str is not iterable (Edgecase) #91

Open dlinx opened 2 years ago

dlinx commented 2 years ago

Getting following error while using kuroshiro but it is only in some cases. 90% of the time, it is not throwing any error. I do not have the input to test for this case.

Stacktrace

TypeError: str is not iterable
    at toRawHiragana (/server/node_modules/kuroshiro/lib/util.js:177:14)
    at /server/node_modules/kuroshiro/lib/core.js:225:88
    at Generator.next ()
    at asyncGeneratorStep (/server/node_modules/kuroshiro/lib/core.js:10:103)
    at _next (/server/node_modules/kuroshiro/lib/core.js:12:194)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
matthieu-locussol commented 1 year ago

This issue occurs when converting a sentence having (U+30FB) character(s) in it. I chose to replace this character with · (U+00B7) character only during the conversion and I'm not having this problem anymore.

Here is a minimal code reproducing the problem (I encountered this problem using furigana mode, but it might occur in different modes too):

const Kuroshiro = require("kuroshiro");
const KuromojiAnalyzer = require("kuroshiro-analyzer-kuromoji");

const sample = async () => {
  const sentence1 = "映画『ジュラシック·パーク』の恐竜は本物そっくりだ。";
  const sentence2 = "映画『ジュラシック・パーク』の恐竜は本物そっくりだ。";

  const kuroshiro = new Kuroshiro();
  await kuroshiro.init(new KuromojiAnalyzer());

  kuroshiro.convert(sentence1, { mode: "furigana", to: "hiragana" }); // Does not throw
  kuroshiro.convert(sentence2, { mode: "furigana", to: "hiragana" }); // Throws
};

sample();

You could imagine having two functions to do this job of converting back and forth:

const sanitizeJapaneseSentence = (sentence: string) => sentence.replace(/・/gi, '·');
const unsanitizeJapaneseSentence = (sentence: string) => sentence.replace(/·/gi, '・');

Hope this can help!