pkumod / gAnswer

A KBQA system based on DBpedia.
http://ganswer.gstore-pku.com
BSD 3-Clause "New" or "Revised" License
376 stars 98 forks source link

关于中文分支两处代码的疑惑 #35

Closed Traeyee closed 4 years ago

Traeyee commented 4 years ago

https://github.com/pkumod/gAnswer/blob/pkubase/src/qa/extract/EntityRecognitionCh.java:262 processedString

    public static Pair<String,List<Word>> processedString(String s)
    {
        List<Word> ret=new ArrayList<>();
        String sentence = "";
        int flag=0;
        String word="";
        for (int i=0;i<s.length();i++)
        {
            if (s.charAt(i)=='{')
            {
                flag=1;
                continue;
            }
            if (s.charAt(i)=='}')
            {
                if (word.length()<=2)
                {
                    sentence+=word;
                    word="";
                    flag=0;
                    continue;
                }
                int FLAG=-1;
                for (Word j:ret)
                    if (word.equals(j.word)) 
                        FLAG=j.pos;
                if (FLAG==-1)
                {
                    flag=0;
                    ret.add(new Word(word,1,ret.size()+1));
                    word="";
                    sentence+=intToCircle(ret.size());
                    continue;
                }
                else
                {
                    flag=0;
                    word="";
                    sentence+=intToCircle(FLAG);
                    continue;
                }
            }
            if (flag==0) sentence+=s.charAt(i);
            if (flag==1) word=word+s.charAt(i);
        }
        return new Pair<String,List<Word>>(sentence,ret);
    }

里的if (word.length()<=2)。为何小于等于两个字就不会被slot呢,是经验做法吗?我感觉常规处理倒是是只过滤单字实体,二字实体感觉也常见。然后我也看到这里有些没被slot的实体后面也在reprocess函数里处理了。然后我发现reprocess函数可能有一个问题,就是312行的if (tmp.length()>4) flag=0;这意思是chars > 4,但我看代码感觉原意可能是words > 4的意思,特别是与for(int len=4;len>=1;len--)for (int j=i+1;j<i+len;j++)isValid[j]=-1;感觉有点对不上号,当len>=2时,会有bug吗?这里的len应该是stride/window的意思吧?

Traeyee commented 4 years ago

看了下中文分支,感觉还是写得有点匆忙。不过很多trick还是能解决实际问题的