Consider parsing models with `binrw`

philpax commented 1 year ago

Something I've noticed in PRs is that we all keep reimplementing the same read/write logic for models (#83, #84, #85, #114). I'd like to minimise this duplication by using binrw to describe the file format in a declarative fashion, making it trivial to support the multiple formats and to read and write to each of them.

I believe that this should be possible and that there aren't any showstoppers, but I could be wrong about this. @KerfuffleV2 is probably the relevant domain expert on parsing models right now - would appreciate your input!

KerfuffleV2 commented 1 year ago

I wouldn't really agree about the expert part, but I can tell you what I think.

About binrw, I actually already checked it out and commented briefly: https://github.com/rustformers/llama-rs/issues/79#issuecomment-1493303627

It basically wants to be using a file that's seekable, it didn't seem like it had first class support for stuff like memory mapping.

I'd also say I don't really like the derive macro style for parsers. It's great when the format aligns nicely with your structures, but I think it's hard to work with when there's a mismatch.

And if you commit to that sort of thing, you don't necessarily know until later on that there's going to be a problem because (presumably) it worked okay for your first parser or you wouldn't have stuck with it.

In that same thread I linked to, you can see the start of an attempt to port the llama-rs module loading to using nom parsing. I don't really like nom, but it feels like it's the best thing that's available right now. It's pretty flexible, basically lets you do what you need to without many restrictions.

Having to manually thread the parsed input is irritating but there are ways to avoid it a lot of the time and it's not something that'll actually stop you from doing what you want.

I also looked at combine but it uses very complicated types/traits and it seems like it's quite hard to do something like have a parser that can handle multiple types of thing. I actually tried to use that first for my pickle parsing and gave up.

nom is what I ended up using for repugnant-pickle. (It also has pretty good support for dealing with complete input like a mmaped file or streaming.)

philpax commented 1 year ago

It basically wants to be using a file that's seekable, it didn't seem like it had first class support for stuff like memory mapping.

The documentation states that

binrw reads data from any object that implements io::Read + io::Seek, and writes data to any object that implements io::Write + io::Seek. (Unseekable streams are also supported, but require a wrapper.) This means that data can come from memory, network, disk, or any other streaming source. It also means that low-level data operations like buffering and compression are efficient and easy to implement.

which seems fine to me. We'd just memory-map the file, wrap a std::io::Cursor, and let binrw work with the cursor.

I'd also say I don't really like the derive macro style for parsers. It's great when the format aligns nicely with your structures, but I think it's hard to work with when there's a mismatch.

And if you commit to that sort of thing, you don't necessarily know until later on that there's going to be a problem because (presumably) it worked okay for your first parser or you wouldn't have stuck with it.

I've generally had a good experience with them; I find a declarative structure much easier to parse through than parsing code where it's applicable. I think it's a worthwhile experiment to see if we can fit GGML/GGMJ/GGJT into binrw and if that makes the parsing any easier to understand.

Regarding nom etc: yeah that's pretty reasonable. nom is my go-to for parsing binary formats when binrw fails me, and while I agree it has some ergonomic complications (...for want of a monad...) it's fine for this kind of thing.

If binrw falls through, I'd suggest using nom and the abstract metadata structure you suggested in https://github.com/rustformers/llama-rs/pull/114#issuecomment-1500349178 .

KerfuffleV2 commented 1 year ago

which seems fine to me. We'd just memory-map the file, wrap a std::io::Cursor, and let binrw work with the cursor.

Sorry, I wasn't clear. That's what I was talking about when I said it didn't have first class support (not working with it directly, requiring a wrapper). I wasn't saying it couldn't work at all.

(...for want of a monad...)

Tell me about it! https://github.com/KerfuffleV2/mdoexperiments

Also, I wasn't just being modest when I said I wasn't a parsing expert. I honestly haven't done much parsing stuff with binary files and Rust. So maybe my lack of enthusiasm for binrw is just fear of the unknown.

I usually like stuff I can control more directly though.

If binrw falls through, I'd suggest using nom and the abstract metadata structure you suggested

The two things aren't really mutually exclusive. Could use binrw, nom, whatever just to parse specific file format metadata and use that to build the more abstract structure I mentioned. If it contains stuff like the storage entity where a thing is and what range it exists at, then no parsing would be needed to actually suck out the tensors (or grab a memory address as the case may be).

iacore commented 1 year ago

The ggml file format is really simple to parse. Every format has <20 .read_* calls.

Adding a 3rdparty library would make the code harder to understand.

As for code reuse, using callback like LoadingProgress::* should work.

philpax commented 1 year ago

The reason I'd like to investigate a declarative format is that it makes it really simple to see the differences between the formats, and to write code that can read/write them. You can see the data as it's laid out in the file without having to read through all of the parsing code, at least in theory.

katopz commented 1 year ago

I use borsh before. Super clean, just struct, but very specific.
I try binrw but the last time i try i can't handle dynamic field size that was specific by other fields (need to cast it first with fancy fn). // maybe just me and maybe fine for our case.
so i try take a look at nom which can parse each field for but i can't figure out how to chain it to next step yet (like we can do in serde).

finally i ran out of time and end up parse it myself hard coded 😅 shame on me.

philpax commented 1 year ago

@iacore has submitted #114 which provides a non-binrw solution. I'm going to close this until we decide what to do with that PR to ensure that people don't work on this.

rustformers / llm

Consider parsing models with `binrw` #117