关于FastText的使用

alangyun commented 4 years ago

我按照FastText给定的样例在做自动分类测试时在训练的时候一直在报Exception in thread "main" java.lang.NoClassDefFoundError: kotlin/UInt的问题,请教一下这是什么原因? 工程时maven工程,在工程上增加了FastText的引用,是否还需要增加kotlin的编译? maven部分的引用

com.mayabot.mynlp fastText4j 3.1.0

maven部分只增加这部分内容,其他未引入?是否还需引入其他的包或者还需要配置对kotlin的依赖(编译) 测试的java代码: public static void testClassify() throws Exception { //数据准备 long t1=System.currentTimeMillis(), t2=0; String rootPath= "D:\template\libsvm-3.24\datas\jingjian"; System.out.println("正在准备格式化训练集，目录"+rootPath); File root = new File(rootPath); StringBuilder sb= new StringBuilder(); File[] menuPaths = root.listFiles(); for (File path : menuPaths) { if (!path.isDirectory()) continue; // 获取分类目录下的文件 File[] trainFiles = path.listFiles(); if (trainFiles.length <= 0) continue; sb.append("label"+path.getName()); for(File file: trainFiles) { String splitStr=FeatureSplitter.splitToStr(file); if(splitStr!=null && splitStr.length()>0) sb.append(" ").append(splitStr); } sb.append("\n");//一行一个目录 } //写入文件 String trainFilename=buildFileName(rootPath,"fast_train"); t2= System.currentTimeMillis(); System.out.println("格式化训练集完成，耗时"+formatTime(t2-t1)+"，写入文件"+trainFilename); t1=t2; writeToFile(trainFilename,sb.toString()); t2= System.currentTimeMillis(); System.out.println("训练集写入完成，耗时"+formatTime(t2-t1)+"。"); t1=t2;

    //训练
    System.out.println("开始训练");
    File trainFile = new File(trainFilename);
    InputArgs inputArgs = new InputArgs();
    inputArgs.setLoss(LossName.softmax);
    inputArgs.setLr(0.1);
    inputArgs.setDim(100);
    inputArgs.setEpoch(20);

    FastText model = FastText.trainSupervised(trainFile, inputArgs);

    //保存模型

// String modelFileName=buildFileName(rootPath,"fastText_model"); t2= System.currentTimeMillis(); System.out.println("训练完成，耗时"+formatTime(t2-t1)+"，准备保存模型到目录"+rootPath); t1=t2; model.saveModel(rootPath); t2= System.currentTimeMillis(); System.out.println("保存模型到文件，耗时"+formatTime(t2-t1)+"，"); t1=t2;

    //加载模型
    System.out.println("保存完成，重新加载模型文件");
    FastText readModel= FastText.Companion.loadModel(new File(rootPath), false);
    t2= System.currentTimeMillis(); 
    System.out.println("模型加载完成，耗时"+formatTime(t2-t1)+"，测试训练结果");
    String text="对于美国海军驱逐舰近日在南海的活动情况，解放军南部战区在3月11日进行了回应。南部战区新闻发言人李华敏大校表示，3月10日，美军“麦克坎贝尔”号导弹驱逐舰擅自闯入中国西沙领海。中国人民解放军南部战区组织海空兵力全程对其跟踪监视、查证识别，并予以警告驱离。美方打着“航行自由”的幌子，一而再、再而三地在南海秀肌肉、挑衅滋事。这是违反国际法规则的霸权行径，是威胁南海地区和平稳定的祸乱之源。中国对南海诸岛及其附近海域拥有无可争辩的主权，中国军队时刻保持高度戒备，采取一切必要措施，坚决捍卫国家主权安全，坚决维护南海地区和平稳定。";
    String[] sText=FeatureSplitter.splitToArray(text);
    List<ScoreLabelPair> m= readModel.predict(Arrays.asList(sText), 5, 0);
    for(ScoreLabelPair slp: m) {
        System.out.println(slp.getLabel()+"\t"+slp.getScore());
    }
}

jimichan commented 4 years ago

我需要看一下您的maven或者gradle的依赖，因为使用了kotlin的uint无符号整数是实验api，我考虑是不是这个在java环境没有开启，我需要验证一下

jimichan commented 4 years ago

还有你jdk的版本

alangyun commented 4 years ago

jdk的版本是1.8,我本工程的maven配置: <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

4.0.0

<artifactId>alangyun-ai</artifactId>
<name>alangyun-ai</name>
<description>the task'message bean of alangyun</description>

<parent>
    <groupId>com.alangyun</groupId>
    <artifactId>alangyun-parent</artifactId>
    <version>0.1.0</version>
</parent>

<properties>
    <mynlp_version>3.1.0</mynlp_version>
</properties>

<dependencies>
    <!-- ansj analyizer -->
    <!-- <dependency> <groupId>org.ansj</groupId> <artifactId>ansj_seg</artifactId> 
        <version>5.1.6</version> </dependency> -->
    <dependency>
        <groupId>com.mayabot.mynlp</groupId>
        <artifactId>mynlp</artifactId>
        <version>3.1.0</version>
    </dependency>
    <!-- mynlp分词 -->
    <dependency>
        <groupId>com.mayabot.mynlp</groupId>
        <artifactId>mynlp-summary</artifactId>
        <version>${mynlp_version}</version>
    </dependency>
    <!-- mynlp自动分类(fasttext) -->
    <dependency>
        <groupId>com.mayabot.mynlp</groupId>
        <artifactId>fastText4j</artifactId>
        <version>${mynlp_version}</version>
    </dependency>
    <!-- mynlp自动摘要 -->
    <dependency>
        <groupId>com.mayabot.mynlp</groupId>
        <artifactId>mynlp-classification</artifactId>
        <version>${mynlp_version}</version>
    </dependency>
    <dependency>
        <groupId>com.alangyun</groupId>
        <artifactId>alangyun-spring-base</artifactId>
        <version>${project.parent.version}</version>
    </dependency>
</dependencies>

父类的maven编译部分的配置如下:

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <configuration>
                <source>${java_version}</source>
                <target>${java_version}</target>
            </configuration>
        </plugin>
    </plugins>
</build>

对fasttext的所有java文件内容如下 public class MyNlpTest {

private static Lexer lexer;

static {

    lexer= Lexers.coreBuilder()      //core分词构建器
            .withPos()         //开启词性
            .withPersonName()  //开启人名
            .build();
}

public static void testSplitWord() {
    String text="对于美国海军驱逐舰近日在南海的活动情况，解放军南部战区在3月11日进行了回应。南部战区新闻发言人李华敏大校表示，3月10日，美军“麦克坎贝尔”号导弹驱逐舰擅自闯入中国西沙领海。\n中国人民解放军南部战区组织海空兵力全程对其跟踪监视、查证识别，并予以警告驱离。美方打着“航行自由”的幌子，一而再、再而三地在南海秀肌肉、挑衅滋事。这是违反国际法规则的霸权行径，是威胁南海地区和平稳定的祸乱之源。中国对南海诸岛及其附近海域拥有无可争辩的主权，中国军队时刻保持高度戒备，采取一切必要措施，坚决捍卫国家主权安全，坚决维护南海地区和平稳定。";

    Sentence result= lexer.scan(text);
    System.out.println(result.toString());
}

public static void testSummary() {

    String text="对于美国海军驱逐舰近日在南海的活动情况，解放军南部战区在3月11日进行了回应。南部战区新闻发言人李华敏大校表示，3月10日，美军“麦克坎贝尔”号导弹驱逐舰擅自闯入中国西沙领海。\n中国人民解放军南部战区组织海空兵力全程对其跟踪监视、查证识别，并予以警告驱离。美方打着“航行自由”的幌子，一而再、再而三地在南海秀肌肉、挑衅滋事。这是违反国际法规则的霸权行径，是威胁南海地区和平稳定的祸乱之源。中国对南海诸岛及其附近海域拥有无可争辩的主权，中国军队时刻保持高度戒备，采取一切必要措施，坚决捍卫国家主权安全，坚决维护南海地区和平稳定。";

    KeywordSummary keywordSummary = new KeywordSummary();
    List<String> keywordList=keywordSummary.keyword(text,10);
    System.out.println("keyword:");
    for(String s: keywordList) {
        System.out.println("----"+s);
    }

    SentenceSummary sentenceSummary = new SentenceSummary();
    List<String> summaryList = sentenceSummary.summarySentences(text, 10);
    System.out.println("summary:");
    for(String s: summaryList) {
        System.out.println("----"+s);
    }
}

private static String buildFileName(String root, String name) {
    String ret=root;
    if(!ret.endsWith("/"))
        ret+="/";
    ret+=name+".txt";
    return ret;
}

private static void writeToFile(String fileName, String content) throws IOException  {

        BufferedWriter writer = new BufferedWriter(
                new OutputStreamWriter(new FileOutputStream(new File(fileName)), "utf-8"));
        writer.write(content);
        writer.flush();
        writer.close();
}

private static String formatTime(long million) {
    long si = million % 1000;
    long x = million / 1000;
    long h = x / 3600;
    long m = (x % 3600) / 60;
    long s = (x % 3600) % 60;
    return String.format("%d:%d:%d.%d", h, m, s, si);
}

public static void testClassify() throws Exception {
    //数据准备
    long t1=System.currentTimeMillis(), t2=0;
    String rootPath= "D:\\template\\libsvm-3.24\\datas\\jingjian";
    System.out.println("正在准备格式化训练集，目录"+rootPath);
    File root = new File(rootPath);
    StringBuilder sb= new StringBuilder();
    File[] menuPaths = root.listFiles();
    for (File path : menuPaths) {
        if (!path.isDirectory())
            continue;
        // 获取分类目录下的文件
        File[] trainFiles = path.listFiles();
        if (trainFiles.length <= 0)
            continue;
        sb.append("__label__"+path.getName());
        for(File file: trainFiles) {
            String splitStr=FeatureSplitter.splitToStr(file);
            if(splitStr!=null && splitStr.length()>0) 
                sb.append(" ").append(splitStr);
        }
        sb.append("\n");//一行一个目录
    }
    //写入文件
    String trainFilename=buildFileName(rootPath,"fast_train");
    t2= System.currentTimeMillis();
    System.out.println("格式化训练集完成，耗时"+formatTime(t2-t1)+"，写入文件"+trainFilename);
    t1=t2;
    writeToFile(trainFilename,sb.toString());
    t2= System.currentTimeMillis();
    System.out.println("训练集写入完成，耗时"+formatTime(t2-t1)+"。");
    t1=t2;

    //训练
    System.out.println("开始训练");
    File trainFile = new File(trainFilename);
    InputArgs inputArgs = new InputArgs();
    inputArgs.setLoss(LossName.softmax);
    inputArgs.setLr(0.1);
    inputArgs.setDim(100);
    inputArgs.setEpoch(20);

    FastText model = FastText.trainSupervised(trainFile, inputArgs);

    //保存模型

// String modelFileName=buildFileName(rootPath,"fastText_model"); t2= System.currentTimeMillis(); System.out.println("训练完成，耗时"+formatTime(t2-t1)+"，准备保存模型到目录"+rootPath); t1=t2; model.saveModel(rootPath); t2= System.currentTimeMillis(); System.out.println("保存模型到文件，耗时"+formatTime(t2-t1)+"，"); t1=t2;

    //加载模型
    System.out.println("保存完成，重新加载模型文件");
    FastText readModel= FastText.Companion.loadModel(new File(rootPath), false);
    t2= System.currentTimeMillis(); 
    System.out.println("模型加载完成，耗时"+formatTime(t2-t1)+"，测试训练结果");
    String text="对于美国海军驱逐舰近日在南海的活动情况，解放军南部战区在3月11日进行了回应。南部战区新闻发言人李华敏大校表示，3月10日，美军“麦克坎贝尔”号导弹驱逐舰擅自闯入中国西沙领海。中国人民解放军南部战区组织海空兵力全程对其跟踪监视、查证识别，并予以警告驱离。美方打着“航行自由”的幌子，一而再、再而三地在南海秀肌肉、挑衅滋事。这是违反国际法规则的霸权行径，是威胁南海地区和平稳定的祸乱之源。中国对南海诸岛及其附近海域拥有无可争辩的主权，中国军队时刻保持高度戒备，采取一切必要措施，坚决捍卫国家主权安全，坚决维护南海地区和平稳定。";
    String[] sText=FeatureSplitter.splitToArray(text);
    List<ScoreLabelPair> m= readModel.predict(Arrays.asList(sText), 5, 0);
    for(ScoreLabelPair slp: m) {
        System.out.println(slp.getLabel()+"\t"+slp.getScore());
    }
}

public static void main(String[] args) {

// MyNlpTest.testSplitWord(); // MyNlpTest.testSummary();

    try {
        MyNlpTest.testClassify();
    }catch(Exception ex) {
        ex.printStackTrace();
    }
}

}

jimichan commented 4 years ago

好的，我在外面，等我回公司来测试一下，你先检查一下你idea的kotlin插件版本是不是在1.3.60以上

alangyun commented 4 years ago

好的,谢谢! 我用的不是idea,是eclipse(photon),一直没安装kotlin插件,测试fasttext时老报错,就装了kotlin插件,eclipse的kotlin插件版本是0.8.0,支持的kotlin版本是1.2.51-release-125

jimichan commented 4 years ago

我在本地测试idea环境，正常运行没有问题。 3.1.0依赖kotlin运行时1.3.61，是不是eclipse的插件有问题，强制使用了1.2.51的运行时？因为UInt只是有在1.3之后才有

alangyun commented 4 years ago

好的,我找一下高版本的插件看看行不行,应该就是你支出的问题

mayabot / mynlp

关于FastText的使用 #18