nyu-dl / dl4marco-bert

BSD 3-Clause "New" or "Revised" License
476 stars 87 forks source link

Error in preprocessing #15

Open canjiali opened 5 years ago

canjiali commented 5 years ago

Hi, I've downloaded the dataset and tried to run the "convert_msmarco_to_tfrecord.py" script. The following error occurred when some lines were read:

Traceback (most recent call last): File "convert_msmarco_to_tfrecord.py", line 217, in main() File "convert_msmarco_to_tfrecord.py", line 211, in main convert_train_dataset(tokenizer=tokenizer) File "convert_msmarco_to_tfrecord.py", line 191, in convert_train_dataset query, positive_doc, negative_doc = line.rstrip().split('\t') ValueError: need more than 1 value to unpack

It seems that the segment length of some lines is less than 3 after splitted by "\t". The count of such lines is 2579. Although I can skip those lines, it may be better to conform with others that this problem actually happened.

rodrigonogueira4 commented 5 years ago

Thanks for noticing this and sorry for the late response.

It seems that there is a bug in the new version of the dataset: https://github.com/dfcf93/MSMARCO/issues/31

Let's wait until they fix it. In the meantime, you can download and use the preprocessed TF Records from here: https://drive.google.com/open?id=1IHFMLOMf2WqeQ0TuZx_j3_sf1Z0fc2-6

canjiali commented 5 years ago

Thanks for sharing the TFRecord files. I've successfully run the model in your shared colab. However, there is a problem that occurs frequently: the notebook stopped training with the following error log:

INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed INFO:tensorflow:Error recorded from infeed: Socket closed

So that I have to reload the checkpoint file and restart training. Do you have any ideas?

rodrigonogueira4 commented 5 years ago

That happens to me as well, but it is not frequent, approximately once every 50 hours of training.

To fix it, I click in "Reset all runtimes" and then "Run all". Training automatically reloads the last checkpoint so it doesn't have to train from scratch every time the error occurs.

canjiali commented 5 years ago

Aha, it happens so frequently to me, approximately 2 hours. I close the colab page when it startes training. Do you keep the page opening?

rodrigonogueira4 commented 5 years ago

I keep the Colab page open all the time.

MateRyze commented 5 years ago

@rodrigonogueira4 could you upload the MS MARCO dataset, that was used for the generation of the provided TFRecords (or at least the top1000.eval.tsv file). http://www.msmarco.org provides just the newest dataset v2.1 Thank you in advance!

rodrigonogueira4 commented 5 years ago

Thanks for your patience. I'm working on that: https://github.com/dfcf93/MSMARCO/issues/31#issuecomment-512864279