pibion commented 4 years ago

The data-format issue that Kaitai addresses is incredibly important for scientific data. But most of the Kaitai tools don’t handle GB-scale files efficiently, or use data structures that are efficient for typical queries on these datasets.

I’m interested in applying for grants to fund work to make Kaitai fit more scientific-data use cases. For example, right now a student is working on adding scikit-hep’s awkward1.0 as a target Kaitai language.

Does a potential influx of two to three scientists working on developing some aspects of Kaitai fit into how you’d like to see Kaitai developed?

Would the Kaitai project leaders be interested in meeting and having a discussion about possible collaboration on grants?

KOLANICH commented 4 years ago

7 (parsing a small struct (just metadata containing boundaries of 7z files in a big file, the actual data occupying the majority of the file is not read) in a 3 GiB file consumes more than 12 GiB of RAM, this is clearly inacceptible) can be relevant to the task.

IDK exactly, since noone has implemented it and tested it in reality, but for me "laying out structures over memory-mapped files" sounds like a prereq for "storing data on medium, not in RAM" which sounds like an absolute prereq for "handling large data files" (of course we can load the chunks ourselves using read, but it feels like more complex and fragile and less performant, though may give a bit smaller memory footprint in some cases).

pibion commented 4 years ago

@KOLANICH I suspect that you're exactly right that some of the features we need require memory-mapping files. In some cases we want to read only select portions of a file and memory mapping (or at least partial mapping) seems like a prerequisite.

Several physics libraries do manage this by providing a read function that allows loading chunks into memory. It would be nice to avoid but it is a somewhat-workable option for our community.

My group is starting out by adding a columnar data store target (scikit-hep’s awkward1.0) to Kaitai. This iteration reads the entire file into memory, which is okay for some scientific data.

But there’s lots of GB files out there that we’d like to be able to read (or read selectively) in python and scan through with the web and Ruby viewers, so memory mapping seems like something we’ll have to do.

GreyCat commented 4 years ago

@pibion Apologies for late reply, unfortunately, I get very little time to spend on KS nowadays.

But most of the Kaitai tools don’t handle GB-scale files efficiently,

That is true, and for some tools (like WebIDE) it is not likely to change due to how browsers local storage works. Probably we can plan some of the relevant work for more desktop-based tools (i.e. ksv, kaitai_struct_gui, etc).

or use data structures that are efficient for typical queries on these datasets.

Overall, the idea of KS is to describe structure of the data, not generate the nicest API that will be "easiest to use", "most performant", etc, as majority of these asks are very relevant to a particular task, and not the structure of format itself. That said, I agree that there's massive room for improvement in terms of adding certain hints to the compiler to generate more optimal code.

Does a potential influx of two to three scientists working on developing some aspects of Kaitai fit into how you’d like to see Kaitai developed?

Any kind of contributions would be great, the only problem that I see is that I personally won't be able to spend a lot of time reviewing / curating these contributions. We had some previous contributions attempts from unmotivated students, and these, unfortunately, didn't went so well.

Would the Kaitai project leaders be interested in meeting and having a discussion about possible collaboration on grants?

We can plan a voice chat if you want. Please contact me at greycat.na.kor@gmail.com if you want to arrange that.

pibion commented 4 years ago

@GreyCat this is a very prompt reply for my community :)

That is true, and for some tools (like WebIDE) it is not likely to change due to how browsers local storage works. Probably we can plan some of the relevant work for more desktop-based tools (i.e. ksv, kaitai_struct_gui, etc).

Targeting desktop tools is exactly what we had in mind. A GB-enabled WebIDE would be amazing, but that's beyond our immediate scope.

Overall, the idea of KS is to describe structure of the data, not generate the nicest API that will be "easiest to use", "most performant", etc, as majority of these asks are very relevant to a particular task, and not the structure of format itself.

This is exactly what drew me to KS initially. The only other descriptive data format I've encountered is DFDL, which uses XML rather than YAML. I find that DFDL isn't as readable as Kaitai (although it's not so bad once you get used to it). But the real issue is that DFDL isn't designed to support multiple languages like Kaitai, and us physicists love our C++ and python.

Currently I have a student who's working to write a new compiler that builds C++ code that stores data in scikit-hep's AwkwardArrays rather than in C++ objects. As you mention, this is probably useful for only some people - people with data that represents many discrete "events" might find it useful.

kaitai-io / kaitai_struct

Kaitai and scientific data #711