midas-research / man-sf-emnlp

61 stars 14 forks source link

npy and pickle files #1

Open dotchen opened 2 years ago

dotchen commented 2 years ago

Hi,

Thank you for releasing the code.

Can you also release the .npy and pickle files loaded in the training script?

dotchen commented 2 years ago

Also, is it possible to share the exact script to generate these files? Many thanks

chenqinkai commented 2 years ago

Hi @dotchen ,

I am also trying to replicate the work. I read the code and I used a small random example to simulate the data. The code seems working. You may need to do the data generation on your own.

I removed all the data loading sections and added:

num_sample = 100
n_stock = 100  # the number of stocks
n_day = 5  # the backward-looking window T
n_tweet_per_day = 1  # max num of tweets per day, I suppose 1 tweet per stock per day
n_price_feat = 3  # price feature dim
n_tweet_feat = 512  # text embedding dim

adj = np.eye(n_stock)  # an adjacency matrix with only self node connection
adj = torch.tensor(adj, dtype=torch.int8)

in train(epoch), I use random data for training:

test_text = torch.tensor(np.random.normal((n_stock, n_day, n_tweet, n_tweet_feat)) ,dtype=torch.float32).cuda()
test_price = torch.tensor(np.random.normal((n_stock, n_day, n_price_feat)) ,dtype=torch.float32).cuda()
test_label = torch.tensor(np.random.choice([0, 1], size=(n_stock, 1)) ,dtype=torch.int8).cuda()

However, there are some typos in the code causing errors, for example: in https://github.com/midas-research/man-sf-emnlp/blob/393fcd91b8aeeb7e806e752dc771c27946bb16e0/train.py#L22 it should be model instead of models in https://github.com/midas-research/man-sf-emnlp/blob/393fcd91b8aeeb7e806e752dc771c27946bb16e0/train.py#L140 it should be args.lr instead of l_r I assume that the author cleaned the code, but did not run it before publishing.

dotchen commented 2 years ago

Thanks @chenqinkai, were you able to follow the provided links and generate the text/price/label embeddings from real data? I am interested in reproducing their numbers published in the paper.

chenqinkai commented 2 years ago

Hi, @dotchen

The provided link is simply the link to Google's Universal Sentence Encoder, it is easy to use. I was not trying to reproduce the number in the paper, I was applying the method with my own data. But the result I am getting is not great.

It is not difficult to get a result, you simply need to construct three matrices, the definition of each axis of each matrix is in my previous post, but it seems difficult to get the exact numbers.

Otherwise, I don't see anywhere in the code using the validation set, not sure how it is used in the training.

dotchen commented 2 years ago

but it seems difficult to get the exact numbers. Otherwise, I don't see anywhere in the code using the validation set

So did you not even get good training accuracies? Also, how did you get the graph? The link points to a paper pdf without further instructions

chenqinkai commented 2 years ago

@dotchen The training loss was at least converging, the in-sample accuracy is ok. But it is a training process without validation, and the accuracy on my test set was not good.

I did not understand the graph building from WikiData either, I did not bother using the same graph as in the paper. I tried using correlation matrix of historical returns or GICS sector as the graph instead.

jeremytanjianle commented 2 years ago

@chenqinkai would you mind sharing your code to construct the matrices?

It is not difficult to get a result, you simply need to construct three matrices, the definition of each axis of each matrix is in my previous post, but it seems difficult to get the exact numbers.

jeremytanjianle commented 2 years ago

The test label only takes in 2 possible values:

test_label = torch.tensor(np.random.choice([0, 1], size=(n_stock, 1)) ,dtype=torch.int8).cuda()

But from the paper they have quoted that they label >+0.55% for positive class and <-0.5% for negative class. So what about the null class? ie, the observations that fall between -0.5% and +0.55%?

chenqinkai commented 2 years ago

@vinitrinh I am not working on the same data as the paper, I am applying the method on my own data, so my code will not work for you directly.

But it is really not difficult, for example for twitter data, you first use Universal Sentence Encoder to transform each twitter into a 512*1 vector. You then group these vectors by stock and date. So for each stock and each date, you will have a matrix of (n_tweet, n_tweet_feat), if there is no enough tweets for that day, you pad it with 0 vector. You then add another two dimensions to the matrix to form a tensor of size (n_stock, n_day, n_tweet, n_tweet_feat), as described in my random data generation:

test_text = torch.tensor(np.random.normal((n_stock, n_day, n_tweet, n_tweet_feat)) ,dtype=torch.float32).cuda()

The same for price data and label data.

For the neutral class, I think they are simply removed, as described in https://aclanthology.org/P18-1183.pdf Section 3, paragraph 2.

chenqinkai commented 2 years ago

@vinitrinh This is from my understanding of the code, not necessarily mean it's correct, it will be much better if the author can clarify or share his code.

rloner commented 2 years ago

@chenqinkai Could you please explain the construction of the graph more clearly? For example, how to construct the graph based on GICS sector? Does it mean the stocks in the same sector has value 1 in the corresponding matrix and 0 otherwise?

I did not understand the graph building from WikiData either, I did not bother using the same graph as in the paper. I tried using correlation matrix of historical returns or GICS sector as the graph instead.

chenqinkai commented 2 years ago

@rloner Yes, then you normalize it with D^-1/2

rloner commented 2 years ago

@chenqinkai Sorry, but is it necessary to normalize the graph matrix in the GAT model? I think It seems to be unnecessary in the paper?

chenqinkai commented 2 years ago

@rloner it is normalized here: https://github.com/midas-research/man-sf-emnlp/blob/393fcd91b8aeeb7e806e752dc771c27946bb16e0/utils.py#L23

But it is a small detail, you can try either way.

rloner commented 2 years ago

@chenqinkai Thank you very much! It would be really really nice if you could upload your code. I seem to still have some difficulties in constructing graph.

TongLiu-github commented 2 years ago

actually I dont think you can reproduce the number in the paper by using the author's codes....If somebody only gives part of his codes with bugs, how could you expect to reproduce the result?