scikit-hep / awkward-0.x

Manipulate arrays of complex data structures as easily as Numpy.
BSD 3-Clause "New" or "Revised" License
215 stars 39 forks source link

Inconsistent Filesizes with .awkd Files #246

Open wctaylor opened 4 years ago

wctaylor commented 4 years ago

I am seeing behavior that I don't understand when saving collections of JaggedArrays with the awkward.save() function.

I have an initial set of arrays each with an outer dimension of 10,000. I save those arrays with

data_dict = {"field1": array1, "field2": array2, ... , "fieldN": arrayN}
awk.save("filename.awkd", data_dict, mode="w")

The resulting filesize is about 280 MB.

I then want to filter out events from those arrays. As an example, let's say I want the first 10 events.

events = numpy.arange(10)
data_dict = {"field1": array1[events], "field2": array2[events], ... , "fieldN": arrayN[events]}
awk.save("filename.awkd", data_dict, mode="w")

This produces a filesize of about 280 kB, which makes sense - I've selected 1/1000 events, so the filesize is about 1000x smaller.

However, now I instead select a more distributed set of 10 events.

events = numpy.arange(0, 1000, 100)
data_dict = {"field1": array1[events], "field2": array2[events], ... , "fieldN": arrayN[events]}
awk.save("filename.awkd", data_dict, mode="w")

The resulting filesize is now back to the original 280 MB.

Is this behavior expected? Or am I doing something wrong? When I load the data back, I do only seem to have access to the events I filtered, but the increased filesize is giving me memory issues (on larger files) as I try to concatenate only the filtered events.

How can I achieve saving a small subset of events for later concatenation?

jpivarski commented 4 years ago

I think the problem is that this serialization is a somewhat naive snapshot of what's in memory, so if there are unreachable elements, they're still written. there isn't an additional pass to look for what can be compacted before writing.

The serialization is one of the things that's lagging in Awkward 1—there are a lot of serialization protocols for this sort of data, it might be a mistake for me to introduce another one. This .awkd file format is the only protocol guaranteed to save all data about an Awkward Array, but as you've noted, you don't always want to save all data. Does the Parquet file format save everything that you need? If you have Lorentz vectors or something, it currently won't save that, but I'm figuring out how to use "application metadata" to include such things.

The Parquet format is considerably more compact. There's also Arrow, but that's a line protocol, not a file format (though there's nothing stopping you from putting the serialized Arrow data in files).

wctaylor commented 4 years ago

I've never used Parquet, so I don't know much about it. Correct me if I'm wrong, but from the awkward documentation, it doesn't look like I could save a collection of different JaggedArrays all to the same Parquet file. It seems one Parquet file = one JaggedArray. Is that right?

jpivarski commented 4 years ago

No, you can zip them into a Table and save the Table of JaggedArrays. In fact, without being a Table, I think we'd have to invent a fake column name for the single JaggedArray; Parquet is usually used for sets of arrays. (The Awkward documentation shows single array ↔ Parquet file examples because it's assumed that you've combined the arrays into a Table.)

The arrays need to have the same length (len), but the number of items in each element do not need to be the same. (That is to say, same number of events, different numbers of particles.) Parquet is intended for columnar datasets with nested structure.

Another thing that I thought of after having sent yesterday's answer is that ROOT's RNTuple would also be capable of storing this information, but the reader/writer of RNTuple in Python is still under development, so that doesn't help you now.

Also in development: Awkward 1's from_arrow and to_arrow are complete, though the final step to Parquet is not. We actually go through Arrow to read and write Parquet, so this is also a nearby option ("nearby" in the sense that taking the last step is not too hard: https://arrow.apache.org/docs/python/parquet.html). See also ak.from_awkward0 and ak.to_awkward0 (https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_awkward0.html).

wctaylor commented 4 years ago

Thanks for the response! I may give that a try later. For now, it seems that just creating a new JaggedArray from the old one and saving the new one does indeed filter out what I'm looking for and reduces filesizes in an expected manner. Something like

events = some_selection_cut
data_dict = {"field1": awkward.JaggedArray.fromiter(array1[events]), 
                    "field2": awkward.JaggedArray.fromiter(array2[events]), 
                    ... , 
                    "fieldN": awkward.JaggedArray.fromiter(arrayN[events])}
awk.save("filename.awkd", data_dict, mode="w")

So I think I at least have a solution to my original struggle, but I am still slightly curious about the behavior when passing event masks - mainly, why contiguous event numbers do seem to drop filesize. I checked if it was related to the highest index selected, but that doesn't seem to be the case.

From before:

# 10 events out of 10,000, expect 1/1000 filesize, and that is what we see
events = numpy.arange(10)
data_dict = {"field1": array1[events], "field2": array2[events], ... , "fieldN": arrayN[events]}
awk.save("filename.awkd", data_dict, mode="w")

vs

# Still 1/1000 of events, but might expect 1/500 in case it needs to record everything up to index 20
# Instead see the full filesize again
events = numpy.arange(0, 20, 2)
data_dict = {"field1": array1[events], "field2": array2[events], ... , "fieldN": arrayN[events]}
awk.save("filename.awkd", data_dict, mode="w")

Not the biggest deal, but I don't know if it's expected

jpivarski commented 4 years ago

It's expected; in some cases, it's a feature, not a bug. But we might want to call out a specific "how to compact an array" for this very common case of filtering in order to make the data size smaller, rather than filtering for statistical significance. (Or for both reasons.)

nsmith- commented 4 years ago

Beware that awkward.JaggedArray.fromiter will be quite slow in awkward0. In general there is some sort of need for a "compactify" operation, which would compute all lazy take operations.