Open dotchen opened 2 years ago
Also, is it possible to share the exact script to generate these files? Many thanks
Hi @dotchen ,
I am also trying to replicate the work. I read the code and I used a small random example to simulate the data. The code seems working. You may need to do the data generation on your own.
I removed all the data loading sections and added:
num_sample = 100
n_stock = 100 # the number of stocks
n_day = 5 # the backward-looking window T
n_tweet_per_day = 1 # max num of tweets per day, I suppose 1 tweet per stock per day
n_price_feat = 3 # price feature dim
n_tweet_feat = 512 # text embedding dim
adj = np.eye(n_stock) # an adjacency matrix with only self node connection
adj = torch.tensor(adj, dtype=torch.int8)
in train(epoch)
, I use random data for training:
test_text = torch.tensor(np.random.normal((n_stock, n_day, n_tweet, n_tweet_feat)) ,dtype=torch.float32).cuda()
test_price = torch.tensor(np.random.normal((n_stock, n_day, n_price_feat)) ,dtype=torch.float32).cuda()
test_label = torch.tensor(np.random.choice([0, 1], size=(n_stock, 1)) ,dtype=torch.int8).cuda()
However, there are some typos in the code causing errors, for example:
in https://github.com/midas-research/man-sf-emnlp/blob/393fcd91b8aeeb7e806e752dc771c27946bb16e0/train.py#L22
it should be model
instead of models
in https://github.com/midas-research/man-sf-emnlp/blob/393fcd91b8aeeb7e806e752dc771c27946bb16e0/train.py#L140
it should be args.lr
instead of l_r
I assume that the author cleaned the code, but did not run it before publishing.
Thanks @chenqinkai, were you able to follow the provided links and generate the text/price/label embeddings from real data? I am interested in reproducing their numbers published in the paper.
Hi, @dotchen
The provided link is simply the link to Google's Universal Sentence Encoder, it is easy to use. I was not trying to reproduce the number in the paper, I was applying the method with my own data. But the result I am getting is not great.
It is not difficult to get a result, you simply need to construct three matrices, the definition of each axis of each matrix is in my previous post, but it seems difficult to get the exact numbers.
Otherwise, I don't see anywhere in the code using the validation set, not sure how it is used in the training.
but it seems difficult to get the exact numbers. Otherwise, I don't see anywhere in the code using the validation set
So did you not even get good training accuracies? Also, how did you get the graph? The link points to a paper pdf without further instructions
@dotchen The training loss was at least converging, the in-sample accuracy is ok. But it is a training process without validation, and the accuracy on my test set was not good.
I did not understand the graph building from WikiData either, I did not bother using the same graph as in the paper. I tried using correlation matrix of historical returns or GICS sector as the graph instead.
@chenqinkai would you mind sharing your code to construct the matrices?
It is not difficult to get a result, you simply need to construct three matrices, the definition of each axis of each matrix is in my previous post, but it seems difficult to get the exact numbers.
The test label only takes in 2 possible values:
test_label = torch.tensor(np.random.choice([0, 1], size=(n_stock, 1)) ,dtype=torch.int8).cuda()
But from the paper they have quoted that they label >+0.55% for positive class and <-0.5% for negative class. So what about the null class? ie, the observations that fall between -0.5% and +0.55%?
@vinitrinh I am not working on the same data as the paper, I am applying the method on my own data, so my code will not work for you directly.
But it is really not difficult, for example for twitter data, you first use Universal Sentence Encoder to transform each twitter into a 512*1 vector. You then group these vectors by stock and date. So for each stock and each date, you will have a matrix of (n_tweet, n_tweet_feat), if there is no enough tweets for that day, you pad it with 0 vector. You then add another two dimensions to the matrix to form a tensor of size (n_stock, n_day, n_tweet, n_tweet_feat), as described in my random data generation:
test_text = torch.tensor(np.random.normal((n_stock, n_day, n_tweet, n_tweet_feat)) ,dtype=torch.float32).cuda()
The same for price data and label data.
For the neutral class, I think they are simply removed, as described in https://aclanthology.org/P18-1183.pdf Section 3, paragraph 2.
@vinitrinh This is from my understanding of the code, not necessarily mean it's correct, it will be much better if the author can clarify or share his code.
@chenqinkai Could you please explain the construction of the graph more clearly? For example, how to construct the graph based on GICS sector? Does it mean the stocks in the same sector has value 1 in the corresponding matrix and 0 otherwise?
I did not understand the graph building from WikiData either, I did not bother using the same graph as in the paper. I tried using correlation matrix of historical returns or GICS sector as the graph instead.
@rloner Yes, then you normalize it with D^-1/2
@chenqinkai Sorry, but is it necessary to normalize the graph matrix in the GAT model? I think It seems to be unnecessary in the paper?
@rloner it is normalized here: https://github.com/midas-research/man-sf-emnlp/blob/393fcd91b8aeeb7e806e752dc771c27946bb16e0/utils.py#L23
But it is a small detail, you can try either way.
@chenqinkai Thank you very much! It would be really really nice if you could upload your code. I seem to still have some difficulties in constructing graph.
actually I dont think you can reproduce the number in the paper by using the author's codes....If somebody only gives part of his codes with bugs, how could you expect to reproduce the result?
Hi,
Thank you for releasing the code.
Can you also release the .npy and pickle files loaded in the training script?