uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Training stuck in Garbage Collector after first epoch Tensorflow #651

Open anisayari opened 3 years ago

anisayari commented 3 years ago

After running the first epoch my training is stuck in an infinite GC... I kept it running for 18hours and is still running, while all the training should be done in <4hours.

I don't understand and I cannot find any ressource online. It happen since I am using Petastorm distributed dataset for tensorflow.

I really do not know what I could do. Any suggestions please ?

Thank you

10003/10003 [==============================] - ETA: 0s - factorized_top_k/top_1_categorical_accuracy: 0.0012 - factorized_top_k/top_5_categorical_accuracy: 0.0059 - factorized_top_k/top_10_categorical_accuracy: 0.0099 - factorized_top_k/top_50_categorical_accuracy: 0.0320 - factorized_top_k/top_100_categorical_accuracy: 0.0531 - loss: 4949.9486 - regularization_loss: 0.0000e+00 - total_loss: 4949.9486WARNING:tensorflow:Using a while_loop for converting BoostedTreesBucketize

2021-03-02T17:47:29.117+0000: [GC (System.gc()) [PSYoungGen: 13969618K->27949K(28663296K)] 14255296K->313635K(86098944K), 0.0210020 secs] [Times: user=0.09 sys=0.00, real=0.02 secs] 
2021-03-02T17:47:29.138+0000: [Full GC (System.gc()) [PSYoungGen: 27949K->0K(28663296K)] [ParOldGen: 285685K->285689K(57435648K)] 313635K->285689K(86098944K), [Metaspace: 233785K->233785K(251904K)], 0.3277457 secs] [Times: user=1.18 sys=0.00, real=0.33 secs] 

2021-03-02T18:17:29.117+0000: [GC (System.gc()) [PSYoungGen: 14582618K->28842K(28669440K)] 14868307K->314539K(86105088K), 0.0199243 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 
2021-03-02T18:17:29.137+0000: [Full GC (System.gc()) [PSYoungGen: 28842K->0K(28669440K)] [ParOldGen: 285697K->285696K(57435648K)] 314539K->285696K(86105088K), [Metaspace: 233808K->233808K(251904K)], 0.3398052 secs] [Times: user=1.23 sys=0.00, real=0.34 secs] 

2021-03-02T18:47:29.117+0000: [GC (System.gc()) [PSYoungGen: 14119819K->27684K(28667392K)] 14405515K->313380K(86103040K), 0.0248371 secs] [Times: user=0.06 sys=0.00, real=0.03 secs] 
2021-03-02T18:47:29.142+0000: [Full GC (System.gc()) [PSYoungGen: 27684K->0K(28667392K)] [ParOldGen: 285696K->285671K(57435648K)] 313380K->285671K(86103040K), [Metaspace: 233812K->233812K(251904K)], 0.3002188 secs] [Times: user=0.71 sys=0.00, real=0.30 secs] 

2021-03-02T19:17:29.117+0000: [GC (System.gc()) [PSYoungGen: 14235740K->28839K(28672512K)] 14521411K->314518K(86108160K), 0.0200179 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 
2021-03-02T19:17:29.137+0000: [Full GC (System.gc()) [PSYoungGen: 28839K->0K(28672512K)] [ParOldGen: 285679K->285716K(57435648K)] 314518K->285716K(86108160K), [Metaspace: 233840K->233840K(251904K)], 0.2681088 secs] [Times: user=0.70 sys=0.00, real=0.27 secs] 

2021-03-02T19:47:29.117+0000: [GC (System.gc()) [PSYoungGen: 14162318K->27884K(28670976K)] 14448035K->313608K(86106624K), 0.0222306 secs] [Times: user=0.09 sys=0.00, real=0.03 secs] 
2021-03-02T19:47:29.139+0000: [Full GC (System.gc()) [PSYoungGen: 27884K->0K(28670976K)] [ParOldGen: 285724K->285709K(57435648K)] 313608K->285709K(86106624K), [Metaspace: 233849K->233849K(251904K)], 0.4094871 secs] [Times: user=1.43 sys=0.00, real=0.41 secs] 

2021-03-02T20:17:29.117+0000: [GC (System.gc()) [PSYoungGen: 14255118K->28741K(28675072K)] 14540828K->314459K(86110720K), 0.0215092 secs] [Times: user=0.10 sys=0.00, real=0.03 secs] 
2021-03-02T20:17:29.138+0000: [Full GC (System.gc()) [PSYoungGen: 28741K->0K(28675072K)] [ParOldGen: 285717K->277045K(57435648K)] 314459K->277045K(86110720K), [Metaspace: 233853K->233540K(251904K)], 0.5166519 secs] [Times: user=1.83 sys=0.00, real=0.51 secs] 
...
selitvin commented 3 years ago

More information is needed. Perhaps you can provide a small reproducable example a dummy dataset? Which function call results in this infinite loop? Only component that relies on Java GC is an HDFS driver (if you are using Java based HDFS driver). Otherwise, there not sure which GC is emitting these log messages.

mirik123 commented 3 years ago

It can be because of the tf.keras.layers.experimental.preprocessing.Discretization layer. Replace it with the sklearn.preprocessing.KBinsDiscretizer , outside of the model - and the training will run much quicker.