pySpark job freezes - Githubissues

For the third time in a row it hangs in the same place. Sometimes it just freezes, sometimes end up flooding org.apache.spark.network.server.TransportChannelHandler errors. It also bothers me because after each restart of the Job bucket increases in size and it is not clear to me, either he downloads what was unavailable the last time, or some action repeats and downloads the same files.

log1

total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3758 - count: 9976071
worker  - success: 0.845 - failed to download: 0.139 - failed to resize: 0.016 - images per sec: 24 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3755 - count: 9986071

log3

total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3728 - count: 9972142
worker  - success: 0.847 - failed to download: 0.134 - failed to resize: 0.019 - images per sec: 1 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 580 - count: 9982142

log3

total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3747 - count: 9892144
worker  - success: 0.853 - failed to download: 0.133 - failed to resize: 0.014 - images per sec: 27 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3741 - count: 9902144
worker  - success: 0.852 - failed to download: 0.130 - failed to resize: 0.018 - images per sec: 27 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3742 - count: 9912144
worker  - success: 0.852 - failed to download: 0.133 - failed to resize: 0.015 - images per sec: 27 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3745 - count: 9922144
worker  - success: 0.847 - failed to download: 0.138 - failed to resize: 0.015 - images per sec: 27 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3740 - count: 9932144
worker  - success: 0.855 - failed to download: 0.127 - failed to resize: 0.018 - images per sec: 25 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3740 - count: 9942144
worker  - success: 0.849 - failed to download: 0.135 - failed to resize: 0.016 - images per sec: 23 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3738 - count: 9952144
worker  - success: 0.848 - failed to download: 0.137 - failed to resize: 0.016 - images per sec: 27 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3731 - count: 9962144
worker  - success: 0.844 - failed to download: 0.141 - failed to resize: 0.015 - images per sec: 24 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3728 - count: 9972144
worker  - success: 0.852 - failed to download: 0.130 - failed to resize: 0.018 - images per sec: 21 - count: 10000
total   - success: 0.851 - failed to download: 0.133 - failed to resize: 0.016 - images per sec: 3624 - count: 9982144
22/10/04 18:52:28 WARN org.apache.spark.network.server.TransportChannelHandler: Exception in connection from /10.128.15.220:53598
java.io.IOException: Connection timed out
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
    at sun.nio.ch.IOUtil.read(IOUtil.java:192)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
    at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133)
    at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:750)

kakaobrain / coyo-dataset

pySpark job freezes #5