Closed OscarBarreraGithub closed 3 months ago
Hey @OscarBarreraGithub!
Could you elaborate a little on what you're suggesting here? The parquet files are packed in shuffled chunks of events and is not designed to provide fast, random access on a per-row level, like the SQLite format is well-suited for.
Hey @RasmusOrsoe,
ah, I see. However, I was under the impression that the benefit of Parquet is that we can work with larger files (since they take up ~ 1/7th of the space of .db SQLite files). However, this then means that we can no longer batch by file as it will be too computationally expensive - especially for very high energy events.
My current workaround is to chunk my large Parquet files in a preprocessing step, and then feeding this directory to the trainer. Is this the optimal way (rather than batching by # of events in each file)?
The trade-off between parquet and sqlite in our context is basically this:
SQLite provides a single (sometimes a few) uncompressed file(s) that has very fast, random access to rows. This means that the resulting Dataset class points to individual event ids and stream them individually as you train. So it provides you with the ability to dynamically change which part of the dataset you'd like to train on. I.e. you can bundle muons and neutrinos together in the database, and choose-as-you-go if you want to train on the full dataset or a sub sample of it. This format provides the fastest random access with the smallest memory footprint and is quite useful for downstream analytics/plotting.
Parquet files from the Parquet converter gives you many compressed (~8x smaller than sqlite) parquet files - each of which is a random sample of the entire dataset you converted. Because you do not have fast random access to the individual events, this format does not allow you to dynamically change which events you train on, but on the other hand, it allows you to train on datasets that would take up many tb's of space in sqlite format. Instead, you load in each batch at a time and train on the content sequentially, meaning that the memory footprint is higher than for the sqlite alternative. In the Parquet dataset, you can choose to train on all batches or some of them.
We would have strongly preferred to provide random access to events in the parquet files as well, but I was unable to provide that access at speeds sufficient for real-time streaming of the data. If you have a local version that appears to be able to do this, I would very much like to see the details :-)
I hope this was helpful. More information here: https://graphnet-team.github.io/graphnet/datasets/datasets.html#sqlitedataset-vs-parquetdataset
The
_calculate_sizes
function (which calculates the number of events in each batch) within theParquetDataset
class calculates the batch size by appending the length of each file inside the batch.It would be useful to batch by event rather than by file so we can process high energy events (which have many rows per event) without manually chunking the .parquet file beforehand (.parquet is well suited to handle large files anyway).
I am working on a fix by updating the way
query_table
batches events.