hightman / scws

开源免费的简易中文分词系统,PHP分词的上乘之选!
http://www.xunsearch.com/scws/
Other
1.66k stars 348 forks source link

top_word结构中attr的怪现像 #79

Closed 1371030 closed 1 year ago

1371030 commented 1 year ago

struct scws_topword { char *word; float weight; short times; char attr[2]; scws_top_t next; }; get_words函数中是strncpy(top->attr, cur->attr, 2);对attr进行值复制,在用其它语言调用时发现attr长度从1-6不等,长度超过2时有乱码显示。把scws_topword的attr数组改成3后,不管什么情况,长度都统一显示到2。get_tops函数则没有些情况

1371030 commented 1 year ago

每行下方的数字为attr的长度,用get_tops时和命令行调用一致,attr长度正常 电影 n 25.0200004577637(6) 1 创作 vn 14.8199996948242(3) 2 版权 n 14.7600002288818(3) 1 专有 vn 13.9399995803833(2) 2 陈凯歌 nr 11.8699998855591(1) 2

[root@localhost ~]# scws -r /etc/scws/rules.utf8.ini -d /usr/share/scws/dict.utf8.xdb -c utf8 -I -A -t 5 -a~v -i 110.txt No. WordString Attr Weight(times)

  1. 电影 n 25.02(6)
  2. 创作 vn 14.82(3)
  3. 版权 n 14.76(3)
  4. 专有 vn 13.94(2)
  5. 陈凯歌 nr 11.87(1)
1371030 commented 1 year ago

对scws中的get_words进行修改

就出现分段错误的提示

1371030 commented 1 year ago

对cli中的scws_cmd修改,显示get_words中attr的长度,确实发现有问题,补丁文件 //--- scws-1.2.3/cli/scws_cmd.c.orig 2013-01-06 13:39:51 //+++ scws-1.2.3/cli/scws_cmd.c 2022-12-22 20:42:38 //@@ -286,6 +286,25 @@ // fprintf(fout, "EMPTY records!\n"); // } // //+ fprintf(fout, "No. WordString Attr Weight(times)\n"); //+ fprintf(fout, "-------------------------------------------------\n"); //+ if ((top = xtop = scws_get_words(s, attr)) != NULL) //+ { //+ tlimit = 1; //+ while (xtop != NULL) //+ { //+ fprintf(fout, "%02d. %-24.24s %-4.2s %.2f(%d) %4d-\n", //+ tlimit, xtop->word, xtop->attr, xtop->weight, xtop->times, strlen(xtop->attr)); //+ xtop = xtop->next; //+ tlimit++; //+ } //+ scws_free_tops(top); //+ } //+ else //+ { //+ fprintf(fout, "EMPTY records!\n"); //+ } //+ // if (xmode & XMODE_STAT_FILE) // free(str);
// }

显示结果最后一列为attr的长度

  1. 研究 vn 4.45(1) 5-
  2. 生命科学 n 7.37(1) 1-
  3. 北京 ns 6.35(1) 5-
  4. 大学生 n 4.70(1) 1-
  5. 喝 vn 0.00(1) 5-
  6. 进口 vn 4.87(1) 5-
  7. 红酒 n 6.17(1) 1-
hightman commented 1 year ago

最多就是2字节,你不能当string直接用发自我的 iPhone6艹在 2022年12月22日,20:54,1371030 @.***> 写道: 对cli中的scws_cmd修改,显示get_words中attr的长度,确实发现有问题,补丁文件 `--- scws-1.2.3/cli/scws_cmd.c.orig 2013-01-06 13:39:51 +++ scws-1.2.3/cli/scws_cmd.c 2022-12-22 20:42:38 @@ -286,6 +286,25 @@ fprintf(fout, "EMPTY records!\n"); }

fprintf(fout, "No. WordString               Attr  Weight(times)\n");

fprintf(fout, "-------------------------------------------------\n");

if ((top = xtop = scws_get_words(s, attr)) != NULL)

{

    tlimit = 1;

    while (xtop != NULL)

    {

        fprintf(fout, "%02d. %-24.24s %-4.2s  %.2f(%d) %4d-\n",

            tlimit, xtop->word, xtop->attr, xtop->weight, xtop->times, strlen(xtop->attr));

        xtop = xtop->next;

        tlimit++;

    }

    scws_free_tops(top);

}

else

{

    fprintf(fout, "EMPTY records!\n");

}

if (xmode & XMODE_STAT_FILE)
    free(str);  

}�`

显示结果最后一列为attr的长度

研究 vn 4.45(1) 5- 生命科学 n 7.37(1) 1- 北京 ns 6.35(1) 5- 大学生 n 4.70(1) 1- 喝 vn 0.00(1) 5- 进口 vn 4.87(1) 5- 红酒 n 6.17(1) 1-�

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

1371030 commented 1 year ago

get_tops同样是复制2字节没有这情况,不明白get_words为啥会出现超过2个字节就不等长的情况

hightman commented 1 year ago

结构不一样。发自我的 iPhone6艹在 2022年12月26日,17:31,1371030 @.***> 写道: get_tops同样是复制2字节没有这情况,不明白get_words为啥会出现超过2个字节就不等长的情况

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

l1t1 commented 1 year ago

strlen只能处理以'\0'结尾的字符串,xtop->attr只是一个两元素的字符数组