jhlau / doc2vec

Python scripts for training/testing paragraph vectors
Apache License 2.0
639 stars 191 forks source link

how is the text preprocessing done ? #36

Open iTomxy opened 2 years ago

iTomxy commented 2 years ago

Hi, I want to extract the doc2vec features of those sentences in MS COCO. But I'm not quite sure how the preprocessing is performed.

It's said that the articles are tokenised and lowercased using Stanford CoreNLP in the paper. From the files under toy_data/ and the two py files, I guess that an article is squashed into a single line in those *_docs.txt files. But these two files are already processed.

Now I've installed the Stanford CoreNLP and can call it from command line. After concatenating the 5 sentences for a COCO image (seperated by a space), treating is as an article, and writing it into input.txt, my calling is like:

java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -outputFormat conll -output.columns word -file input.txt

However, the output is not lowercased. How should I modify the command to enable lowercasing ?

By the way, there are other tokenization options shown here, like americanize. Did you use them when training the doc2vec model ? If possible, I hope you can provide the details of your preprocessing method.

Thanks

jhlau commented 2 years ago

We don't provide support for corenlp - you might want to ask them directly for the lowercase question. As for tokenisation options, it's entirely up to you how you want to tokenise your documents (and doc2vec will be able to work with it). I believe there's more tokenisation details in the paper what we did, but unfortunately it's not something I can remember...