Closed davidefiocco closed 5 years ago
@davidefiocco I have made the modifications in transform_roc
method as below for the entailment task, which is a classification problem:
def transform_roc(X1, X2, X3):
n_batch = len(X1)
xmb = np.zeros((n_batch, 2, n_ctx, 2), dtype=np.int32)
mmb = np.zeros((n_batch, 2, n_ctx), dtype=np.float32)
start = encoder['_start_']
delimiter = encoder['_delimiter_']
for i, (x1, x2, x3), in enumerate(zip(X1, X2, X3)):
x12 = [start] + x1[:max_len] + [delimiter] + x2[:max_len] + [clf_token]
x13 = [start] + x1[:max_len] + [delimiter] + x3[:max_len] + [clf_token]
l12 = len(x12)
l13 = len(x13)
xmb[i, 0, :l12, 0] = x12
xmb[i, 1, :l13, 0] = x13
mmb[i, 0, :l12] = 1
mmb[i, 1, :l13] = 1
# Position information that is added to the input embeddings in the TransformerModel
xmb[:, :, :, 1] = np.arange(n_vocab + n_special, n_vocab + n_special + n_ctx)
return xmb, mmb
You can refer to my question in #40 . Hope this helps.
@lordzuko thanks!
I had seen https://github.com/huggingface/pytorch-openai-transformer-lm/issues/40 and that's excellent guidance for me on how that transform
function should change depending on the task. However, I thought that given that in the README
there is an "architectural" difference between classification and entailment (see https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/assets/ftlm.png) edits should be slightly more profound with respect to the original transform_roc
and I should use a function of the form
def transform_snippet(X1):
n_batch = len(X1)
xmb = np.zeros((n_batch, 2, n_ctx, 2), dtype=np.int32)
mmb = np.zeros((n_batch, 2, n_ctx), dtype=np.float32)
start = encoder['_start_']
for i, (x1), in enumerate(X1):
x12 = [start] + x1[:max_len] + [clf_token]
l12 = len(x12)
xmb[i, 0, :l12, 0] = x12
mmb[i, 0, :l12] = 1
# Position information that is added to the input embeddings in the TransformerModel
xmb[:, :, :, 1] = np.arange(n_vocab + n_special, n_vocab + n_special + n_ctx)
return xmb, mmb
but this is not working as intended as of now..
@davidefiocco Are you getting any error? or the model is not training as intended, if you are getting an error, can you please share the error-trace, please.
@lordzuko I have tried implementing the steps that I described in https://github.com/huggingface/pytorch-openai-transformer-lm/issues/43#issue-362661674 (hoping they make sense...) and got
Traceback (most recent call last):
File "train.py", line 218, in <module>
teX, teM = transform_snippet(teX1)
File "train.py", line 29, in transform_snippet
xmb[i, 0, :l12, 0] = x12
ValueError: setting an array element with a sequence.
I can try to make this reproducible is to provide more code/publish a fork as I modified the current code in several parts (see my comment above for the full list) when trying to implement the head change (https://github.com/huggingface/pytorch-openai-transformer-lm/issues/43#issuecomment-424622723 is one of the changes). I am not very proficient in PyTorch yet though (so they may be clumsy changes), that's why the questions in https://github.com/huggingface/pytorch-openai-transformer-lm/issues/43#issue-362661674 . Most likely, the transform_snippet
function posted in https://github.com/huggingface/pytorch-openai-transformer-lm/issues/43#issuecomment-424622723 is not OK.
Hi @davidefiocco,
Your transform_snippet function should be the way to go.
I think it's just a python typo. Looks like your l12
is equal to one. Probably comes from this line: for i, (x1), in enumerate(X1)
. Try using for i, x1 in enumerate(X1)
Hi @thomwolf, thanks for your reply and tip!
As advertised I forked the code, and you find the result at https://github.com/huggingface/pytorch-openai-transformer-lm/compare/master...davidefiocco:master and that specific edit can be found at https://github.com/davidefiocco/pytorch-openai-transformer-lm/blob/e9945725603544cdebaec91937d4a16f14db0ad8/train.py#L26
In the fork namings ,news
stands for "newsgroup", as I tried to classify snippets of text coming from a (2-newsgroup) subset of the 20 newsgroup dataset (http://scikit-learn.org/0.19/datasets/twenty_newsgroups.html). I haven't been successful in using the algorithm yet (the code now runs without errors, but iterations don't seem to converge).
I will update this issue if I manage to get it sorted, and if someone is keen on giving feedback on what needs to be changed in the code I'll be very happy to work on it.
I had another bug, which I think I fixed with https://github.com/davidefiocco/pytorch-openai-transformer-lm/commit/d546da7c7076fac73d8fc850b2d0066edc36680c
And I seem to converge and reproduce the 91+% evaluation accuracy on SST-2.
I am still not sure that everything is really fine, but it seems to converge at least now!
Hi,
I am new to PyTorch (but still more at ease with it than TF) so I thought to experiment with @thomwolf 's implementation in this repo (thanks for sharing it!!)
I would like to try out the code to perform binary text classification of text snippets, similar to the classification tasks such as the Corpus of Linguistic Acceptability (CoLA) and the Stanford Sentiment Treebank (SST-2) in the original reference.
These are the steps that I think are needed to get the code working (but I am not sure that these are correct and/or exhaustive):
snippets_val.csv
andsnippets_test.csv
containing two columns,text
(string) andclass
(an int equal to 0 or 1).datasets.py
create two new functions:_snippets
returning two listsst, y
, andsnippets
defined with different values ofn_train
andn_valid
and whose return statement looks likereturn (trX, trY), (vaX, vaY), (teX, )
train.py
, rewritetransform_roc
into atransform_snippet
that doesn't use[delimiter]
and takes only one argument in input <- somewhat tricky to me can anyone provide some guidance?train.py
, in the encoding bit and afterwards:encode_dataset
to match the output of the function ofsnippets
redefined above.encoder['_delimiter_'] = len(encoder)
n_special = 2
as we got rid of['_delimiter_']
2
and3
in their name (?) e.g. in the definition ofn_ctx
<- somewhat tricky to me can anyone provide some guidance?train.py
:dh_model
to use('classification', 2)
instead of'multiple_choice'
ClassificationLossCompute
instead ofMultipleChoiceLossCompute
analysis.py
:snippets
so to invoke_snippets
(fromdatasets.py
) and read insnippets_test.csv
and adjust its call to_snippets
so to take into account that it outputs two lists (not 4)train.py
coherently with all of the above.Does all of the above make sense as a plan, or can somebody fill missing bits or provide an alternative list of "sub-steps" ? Also, can someone provide some guidance on how to rewrite
transform_roc
(comments on the original code would be fantastic, I am glad to annotate the original function and contribute to the repo as a result of this!)Thanks to anyone patiently reading this!