Text classification example using 20 newsgroup

jinyichao commented 7 years ago

As looking deeply into the example code, I realize that the dataset used for text classification example seems not appropriate.

In the dataset code (i.e., BigDL/pyspark/bigdl/dataset/news20.py), it downloads the data from http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz, where the first a few lines actually indicate the category and some other metadata info.

At the same time, in the sample code (i.e., https://github.com/intel-analytics/BigDL/blob/master/pyspark/bigdl/models/textclassifier/textclassifier.py), we set "sequence_len = 50" which means only the first 50 word vectors are used as the feature. It seems unfair to use such indication to predict the category itself.

Is it better to use the clean dataset (i.e., http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz)?

Please correct me, if I am wrong. Otherwise, I will try to create a PR to modify it. :)

zhichao-li commented 7 years ago

PR is welcome here. You can refer to: https://github.com/intel-analytics/BigDL/tree/master/spark/dl/src/main/scala/com/intel/analytics/bigdl/example/textclassification for more info.

jinyichao commented 7 years ago

Thanks for the clarification, and the scala version example seems doing the task in a correct way (using a large sequence length at 1000). I have one error when modifying the Python example. Here is the error message: element number must match Reshape size. But In Reshape@8a1fd53c : element number is: 6144 , reshape size is: 2048 Do you have any clue on how to come over this? Thank you!

zhichao-li commented 7 years ago

You need to modify parameters of ``Reshape``` to meet the size required if you changed the parameters of previous layer.

jinyichao commented 7 years ago

I just changed the 'sequence_len' value at the beginning. It is later passed into the first Reshape layer as a parameter. As such, I am a bit puzzled about where and how to further change other parameters of the following layers.

jinyichao commented 7 years ago

Here is the first layer definition in the code model.add(Reshape([embedding_dim, 1, sequence_len]))

zhichao-li commented 7 years ago

You can take a look at this file for those parameters: https://github.com/intel-analytics/BigDL/blob/master/spark/dl/src/main/scala/com/intel/analytics/bigdl/example/utils/TextClassifier.scala Wondering how's the progress for your PR?

jinyichao commented 7 years ago

Yeah, the Python example used the same parameter in this Scala example, here is the code:

    model = Sequential()
    model.add(Reshape([embedding_dim, 1, sequence_len]))
    model.add(SpatialConvolution(embedding_dim, 128, 5, 1).set_name('conv1'))
    model.add(ReLU())
    model.add(SpatialMaxPooling(5, 1, 5, 1))
    model.add(SpatialConvolution(128, 128, 5, 1).set_name('conv2'))
    model.add(ReLU())
    model.add(SpatialMaxPooling(5, 1, 5, 1))
    model.add(Reshape([128]))
    model.add(Linear(128, 100).set_name('fc1'))
    model.add(Linear(100, class_num).set_name('fc2'))
    model.add(LogSoftMax())

But it has the resharp number unmatch issue. As such, I suspect the problem is in the pre-processing part.

As for the PR, I was able to modify the source into the clean one without the headers and metadata (i.e., http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz), but the direct results are not good (i.e., around 62% accuracy if not changing any parameter).

I didn't manage to dig deeper into the source code, because these days are quite busy. Would appreciate if there is any further hints on this issue.

intel-analytics / ipex-llm

Text classification example using 20 newsgroup #1493