There's too much stuff in IBC files

edwinb commented 9 years ago

Currently, IBC files contain everything that is generated by elaboration - that's all the typechecked data types and functions, as well as recording the original definitions, metavariables, what's on each line for interactive editing, documentation, etc.

All of this stuff is useful to have available, but the problem is that when it's not used (which is most of the time) it really hurts both in memory usage and loading time. Currently, even when elaborating from scratch, rebuilding an IBC is the biggest bottleneck in elaboration (that will happen to some extent of course since importing will load the IBC).

I'd suggest splitting it into two files - one containing the things which are absolutely necessary when importing (that's types, definitions and other bits and pieces to help with elaboration) which is always loaded on 'import', and one containing the things that are useful at the REPL, which will only be loaded on demand.

My recent adventures in profiling suggest this will be a big win. We really are eating quite a lot of memory at the minute and I think a lot of it is unnecessary!

(While we're at it, it might be good to put generated files in a different directory rather than dumping them with the source. That's probably a separate issue though.)

edwinb commented 9 years ago

I should add that I also suspect there to be huge loss of sharing when writing out IBC files. Terms by their dependently typed nature have lots of shared subterms, and no attempt is made to deal with this when writing out, so when read back in, sharing has gone.

I haven't measured either of these effects precisely, by the way, so take with a pinch of salt. All I know is that 'get' in Binary/IBC is allocating all the memory. That is not at all surprising, of course, but it's still worth looking for places to reduce it.

edwinb commented 9 years ago

Also, prodding with different profiling options, I see most memory is eaten by Text and Bytestrings (most likely in variable names but I haven't poked that deeply). I suspect a massive loss of sharing here so we need to work harder to make sure strings are shared.

By the way, I'm just leaving notes here - I don't have time to poke further at the minute but I don't want to forget what I've found so far... if anyone finds anything from some profiling that's eating too much please also make a note here.

melted commented 9 years ago

It should be possible to improve the performance without discarding data from IBC files. The expanded IBC files in libs are 24,2 MB (I shut off the compression to get a sense of how much is in there), compiling a trivial Idris program churns through 1,9 GB, nearly all of it in loadModule, A heap profile shows heap objects from Core/Binary.hs and IBC.hs that are hundreds of kb, some over a meg.

idris-0.9.17.1:Idris.Core.TT.UN 1816512
idris-0.9.17.1:Idris.Core.TT.P  3603328

Puzzling. On the positive side, there's some room for improvement here.

melted commented 9 years ago

Hey, that gives me an idea! https://github.com/idris-lang/Idris-dev/pull/2159

edwinb commented 9 years ago

There's some other tricks that would be nice but I don't have time to explore further. For example, the vast majority of stuff in the prelude (and indeed most imported libraries) is never looked at, so it would be nice if it wasn't loaded until needed.

Last time I did a heap profile was a while ago, in an effort to solve some memory leaks - back then it was mostly TT that ate all the memory iirc.

On 19/04/2015 00:00, Niklas Larsson wrote:

Hey, that gives me an idea! #2159 https://github.com/idris-lang/Idris-dev/pull/2159

— Reply to this email directly or view it on GitHub https://github.com/idris-lang/Idris-dev/issues/1826#issuecomment-94200167.

melted commented 9 years ago

I messed about with this a bit more. I tried to change every name to a Symref and append a map with the names to the file. This failed because Uniplate couldn't walk the parse tree as PDirective contains Idris () and that doesn't derive Data and can't be easily made to as it contains IO. That should probably change anyway, it would be nice to be able to use Uniplate with the parse tree, but I didn't want to put effort into such a hack because it would solve none of the other problems.

I will make another try, this time I'm going to make data structures that represent the disk format for each thing that is saved, using Generic so I don't have to write the Data.Binary stuff. The big thing is obviously to change every string into a reference into a string table. But also make sure that every field has a word size, so that 32 and 64-bit computers use the same file format.

And then I thought about how to solve the being able to access the file randomly. I was halfway to thinking up a new file format when I realized that the best thing to do is to copy Java's .jar and Android's .apk, make it a zip-file and have the parts be files in it. It would also make it easy to have several modules sharing a file, just put folders in there. So I looked up zip handlers in Haskell and found zip-archive, it requires unix, but it doesn't actually seem to need it for anything important. It's just one Haskell file, so it could easily be imported until a fix can be sent upstream.

I will put some effort into this this week or next.

Also, I wonder if it wouldn't be a good idea to make a JSON or XML version of the format. Mainly to be able to manipulate it with existing tools.

melted commented 9 years ago

zip-archive didn't need unix, it just said so on hackage

MaxOw commented 9 years ago

How about using sqlite?

I've recently seen this talk and have been thinking about using it in this types of situations more.

ahmadsalim commented 8 years ago

@edwinb Is still an issue with the new module system?

idris-lang / Idris-dev

There's too much stuff in IBC files #1826