Large Data Objects Not Always Saving Correctly

qsbase / qs

Quick serialization of R objects

405 stars 19 forks source link

Large Data Objects Not Always Saving Correctly #45

Closed RSchwinn closed 4 years ago

RSchwinn commented 4 years ago

I'm not sure how to provide a MWE, but I can describe the situation. Before I do, I want to tell you how much I love this package. It's extraordinary! As you know, it outperforms everything* in terms of speed while maintaining ancillary attributes.

Here is the issue: I'm working with data frames containing between 3 and 150 million rows and 40 or so variables on Linux. Everything works fine using RDS. Things usually go with when I use the much faster QS format. However, once in a while, out of the blue, I'll save a file and it will be smaller than expected. When I open these smaller than expected files, I get the following error: Error in c_qread(file, use_alt_rep, strict, nthreads) : QS format not detected

Is this user error? If you need any other info to troubleshoot, please let me know.

traversc commented 4 years ago

There are 4 magic bytes written at the start of every qs file, 0xBEAC.

That error message means that those bytes are not seen. It could be user error, it's hard to see how it could be a bug but I guess it could be ;)

Can you upload one of the files with an error?

Some possibilities on your end:

If you try to qsave to the same file at the same time from different threads, this could happen.

If you forgot to change your function, e.g. you did something like saveRDS(mydata, "myfile.qs")

RSchwinn commented 4 years ago

If you try to qsave to the same file at the same time from different threads, this could happen.

I think it might be related to this comment. I will explore further. I will try to make a short script defining some random data that displays this behavior.

Thanks for your speedy response!

traversc commented 4 years ago

Hey @RSchwinn, was there any further insight from your exploration? If it's possible to send me one of the broken files, I might be able to figure out what happened.

RSchwinn commented 4 years ago

Unfortunately, I cannot share the files. But the good news is that I’ve only been able to reproduce the issue once. I think it occurs when overwriting an existing files that is also being read by another process. So, user error.

Do you have any plans to extend compatibility to Python? QS far outperforms parquet and all other comparable formats, it should be the ZIP for data.

traversc commented 4 years ago

Thanks, I do have plans to extend to python, but it's a ways off. I'll have to think very carefully how to do this since data in python and R are quite different.