kite-sdk / kite

Kite SDK
http://kitesdk.org/docs/current/
Apache License 2.0
394 stars 263 forks source link

KITE-1053: Fix int overflow bug in FS writer. #403

Closed rdblue closed 9 years ago

rdblue commented 9 years ago

Keeping the number of records written in an int caused a bug where writing more than Integer.MAX_VALUE records (~2B) would overflow the counter and the check to see whether any records had been written would fail because count is less than 0. The fix is to use a long.

rdblue commented 9 years ago

@mkwhitacre, could you review this? It fixes an issue where too many records written from a single reducer will cause data files to be discarded instead of committed.

mkwhitacre commented 9 years ago

Change looks fine. For my own knowledge can you provide a link to the code that checks if any records were written?

rdblue commented 9 years ago

Yeah, the check is here: https://github.com/kite-sdk/kite/blob/master/kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/filesystem/FileSystemWriter.java#L196

When debug logging is on, we see that the else case is taken with the log message: DEBUG [main] org.kitesdk.data.spi.filesystem.FileSystemWriter: Discarded hdfs://.../.33c9684a-f4e9-432b-a003-2efaccfaa0d4.avro.tmp (-1969937604 entities)

mkwhitacre commented 9 years ago

Since the value of count doesn't really matter (mostly only used for logging) would tracking if data was written as a boolean be safer to avoid any overflow errors? This would be in addition to the count value for debug purposes.

rdblue commented 9 years ago

@mkwhitacre, yes and no. Yes it would technically be safer. But I don't think Java can handle files that are larger than Long.MAX_VALUE bytes, which means that if you are overflowing the record count now, you'd also be unable to write anything to the file.

In addition, check out #386 that adds size and time-based file rolling to the internal writers. That requires having the count so I'd rather not remove it to add it back later.

mkwhitacre commented 9 years ago

Ok thanks. #386 and the ability to specify number of writers per partition should hopefully help me eliminate some custom code to try and more evenly distribute data across shuffles and still end up with appropriately sized files.

rdblue commented 9 years ago

@mkwhitacre, great! Feel free to review that one, too.

So this one is good to go?

mkwhitacre commented 9 years ago

+1

rdblue commented 9 years ago

Thanks for reviewing this, @mkwhitacre!