Open canjiali opened 5 years ago
Thanks for noticing this and sorry for the late response.
It seems that there is a bug in the new version of the dataset: https://github.com/dfcf93/MSMARCO/issues/31
Let's wait until they fix it. In the meantime, you can download and use the preprocessed TF Records from here: https://drive.google.com/open?id=1IHFMLOMf2WqeQ0TuZx_j3_sf1Z0fc2-6
Thanks for sharing the TFRecord files. I've successfully run the model in your shared colab. However, there is a problem that occurs frequently: the notebook stopped training with the following error log:
INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed INFO:tensorflow:Error recorded from infeed: Socket closed
So that I have to reload the checkpoint file and restart training. Do you have any ideas?
That happens to me as well, but it is not frequent, approximately once every 50 hours of training.
To fix it, I click in "Reset all runtimes" and then "Run all". Training automatically reloads the last checkpoint so it doesn't have to train from scratch every time the error occurs.
Aha, it happens so frequently to me, approximately 2 hours. I close the colab page when it startes training. Do you keep the page opening?
I keep the Colab page open all the time.
@rodrigonogueira4 could you upload the MS MARCO dataset, that was used for the generation of the provided TFRecords (or at least the top1000.eval.tsv file). http://www.msmarco.org provides just the newest dataset v2.1 Thank you in advance!
Thanks for your patience. I'm working on that: https://github.com/dfcf93/MSMARCO/issues/31#issuecomment-512864279
Hi, I've downloaded the dataset and tried to run the "convert_msmarco_to_tfrecord.py" script. The following error occurred when some lines were read:
Traceback (most recent call last): File "convert_msmarco_to_tfrecord.py", line 217, in
main()
File "convert_msmarco_to_tfrecord.py", line 211, in main
convert_train_dataset(tokenizer=tokenizer)
File "convert_msmarco_to_tfrecord.py", line 191, in convert_train_dataset
query, positive_doc, negative_doc = line.rstrip().split('\t')
ValueError: need more than 1 value to unpack
It seems that the segment length of some lines is less than 3 after splitted by "\t". The count of such lines is 2579. Although I can skip those lines, it may be better to conform with others that this problem actually happened.