Memory usage/overhead? - Githubissues

ioquatix commented 8 years ago

I've been reviewing memory usage in my web framework. I'm sure you already know this but mime-types is a memory hog, accounting for 55% of my base web application memory usage.

> bundle exec derailed bundle:mem
TOP: 11.8633 MiB
  utopia: 11.0156 MiB
    mail: 9.0 MiB
      mime/types/columnar: 6.6094 MiB (Also required by: utopia)
        mime/types: 6.6055 MiB
          mime/types/registry: 6.3438 MiB
      mail/message: 0.6797 MiB
      mail/field: 0.4883 MiB
      mail/core_extensions/string/multibyte: 0.4766 MiB
        mail/multibyte: 0.4492 MiB (Also required by: mail)
    yaml: 0.5117 MiB (Also required by: mail/message)
      psych: 0.5117 MiB
  rake: 0.8203 MiB
    rake/rake_module: 0.4609 MiB
      rake/application: 0.4453 MiB (Also required by: rake)

To me, this seems a bit strange, as the mime-types.json file is only ~400Kbytes.

Is this because you are building a lot of in-memory data structures for the mime types?

If the data is 400Kbytes, and the problem is ruby overhead, what about compiling the data into a C extension and providing a query mechanism? If it's read-only constants in a C library the overhead would be reduced significantly and also shared between processes.

Perhaps you can give me some ideas of whether this memory usage is reasonable according to your own experiences and then perhaps we can brainstorm some ideas to reduce memory usage.

ioquatix commented 8 years ago

The reason why I mentioned "transactional API" in the title was because I was thinking about how to deal with memory usage. If it turns out the current overheads can't be reduced, what about implementing an API that loads and discards the data after use, e.g.

MIME::Types.database do |database|
    database.mime_types_for(".exe")
end
# After this block is done, the database is closed and all data related to it is freed.

halostatue commented 8 years ago

The 6Mb sounds a little high (it used to be ~17); there’s a bit of stuff I can do to reduce usage that I haven’t done yet (there’s an issue in this tracker about a value pool) mostly because I can’t figure out the best way to determine the memory use vs speed overhead of the pool (the pool is a self-referencing hash):

   pool = {}
   application = 'application'
   pool[application] = application

This is important because every single MIME type has a content type (application, image, text, etc.). I’ll look at a few other things.

I’ve been doing a bit of playing (I can use derailed bundle:mem with mime-types directly, which is useful), and a few various changes have reduced the memory use by about 1.5–2Mb. I need to play with the pieces to see which ones are actually effective vs ineffective, but…

ioquatix commented 8 years ago

I'm having a bit of a hack.. the results will be interesting. Generating a mime_types.c file with a static struct based on the mime-types-data gem. It's only taken about 10 minutes. If I can compile this.. it will be interesting to see what the size in memory is like. I could provide fast lookup by generating static hash tables too. We could use the same hash function as Ruby.. it just wouldn't be a Ruby data structure until you actually needed it to be so.

If you had to have preference as to the name of this gem, would you prefer mini-mime-types or mime-types-compiled (or something along those lines)

ioquatix commented 8 years ago

Just FYI, it should be possible to have a loadable .a file thats < 500Kbytes, has fast lookup, and includes all the data..

halostatue commented 8 years ago

Um…probably mime-types-mini; this is also something that I‘ve been thinking about trying to make as a pure Ruby gem. What we’d want to do is have it override the autoloading of the MIME types registry (which can currently be done with an environment variable). What I’m really looking for in mime-types 4 is to not load the registry by default, and only load it when you need it. I think that’s a year away, though.

ioquatix commented 8 years ago

Okay, this is a basic example of what I've generated..

https://github.com/ioquatix/mime-types-mini/blob/master/ext/mime_types.c

It just needs a few Ruby functions to hook it up to the existing infrastructure.

Is there any way to reuse parts of MIME::Type without pulling in the data?

ioquatix commented 8 years ago

It compiles to about 200k so I'm pretty sure this can be a significant improvement over what you've got.. What actually makes it balloon out to 6Mbytes? Is it Ruby overhead?

halostatue commented 8 years ago

It looks like it’s partially Ruby overhead. I’ve got memory use down to 2.5MiB by using the value pool and a couple of other enhancements, which is probably good enough for a point release soon if this is something that I want to pursue.

Derailed provides some useful information with bundle:objects. Before the value pool, there are 93,567 objects allocated (7.3 MiB) and 30,716 objects retained (2.7 MiB). After, there are 98,982 objects allocated (8.4 MiB) and 22,171 objects retained (2.6 MiB). Startup allocates more objects (the value pool does value.dup.freeze) and is ~5 times slower than without the value pool. I haven’t got any decent benchmarks on what the performance changes are for everything else.

One thing that looks troublesome is the use of Set, and where a more efficient data structure than Set may be desirable (about 1.2MiB of memory is retained in Set objects, compared to the 1.4MiB of memory retained in various mime-types objects: 755k for type, 360k for columnar data; 240k for container; 187k for the registry type). So after the value pool version is released, I’ll look at what I can do to reduce set. After using the value pool, Set is still about 1.2MiB, and the value pool is about 850k, but the types themselves are much lower use because of the shared resources.

ioquatix commented 8 years ago

For me, and perhaps for many others, what makes the most sense is having a static library with fast lookup and low overhead.

I've almost got it working, using perfect hashing (gperf). The problem is that it isn't easy to add things to this hash table at run time, so things like adding mime types isn't possible.

The compiled code is going to be < 500Kbytes and the run time performance for the queries I want to do should be significantly better than Ruby hash tables. It will be an interesting experiment.

As an aside it would be nice if you could make Mime::Types slightly more configurable. As it stands, it's not easy to replace Mime Types because so many libraries depend on it, but it would be nice if I could plug in Mime::Types::Mini and have it just work with all the existing code that uses the Mime::Types API.

halostatue commented 8 years ago

The changes for the value pool are here: https://github.com/mime-types/ruby-mime-types/tree/qa-tool-value-pool. It includes other stuff I was working on for a point release, so…

ioquatix commented 8 years ago

@halostatue Okay, so I've got the basics working.

The compiled database I'm generating is 200k, has almost zero load time, and does quick lookups. It's just a proof of concept.

https://github.com/ioquatix/mime-types-mini

To try it out

require 'mime/types/mini'

Mime::Types::Mini::Database.content_type_for_extension("svg")
 => "image/svg+xml"

There is a lot of flexibility here but I've just done simple K-V mapping.

I've used gperf to generate "perfect hashing" hash tables, so the hash table is customised to the data..

It's not slow.. perhaps it's not fast either, but:

Benchmark.realtime{Mime::Types::Mini::Database.content_type_for_extension("svg")}
 => 6.5820058807730675e-06

Ideas? Thoughts? Where to go from here?

ioquatix commented 8 years ago

BTW, I've released a really crappy 0.1.0 gem so you can install it as a proof of concept.

ioquatix commented 8 years ago

Okay, so did some more mucking about with this.

Firstly, I took a look at the mail gem. And a few others. I found that at least in the case of the mail gem, the use case was really limited. Given a filename, tell me the content type.

So, I think whatever I do, I'll focus on an API that works for this use case as fast and memory efficiently as possible.

I've built an API in mime-types-mini and found it's at least 4 times faster:

Ruby 2.3.0 at 2016-10-10 01:11:30 +1300
Warming up --------------------------------------
         MIME::Types    16.685k i/100ms
   Mime::Types::Mini    56.401k i/100ms
Calculating -------------------------------------
         MIME::Types    181.928k (± 5.4%) i/s -    917.675k in   5.059039s
   Mime::Types::Mini    714.074k (± 6.6%) i/s -      3.553M in   4.999965s

Comparison:
   Mime::Types::Mini:   714074.1 i/s
         MIME::Types:   181928.1 i/s - 3.93x  slower

This is with no real optimisation or anything else. It's also 20x faster to load which is pretty significant since it's going to be a large chunk of application startup time for a lot of users.

Finally, memory usage. As we already know, MIME::Types uses about 6Mbytes on average, and yo might have got that down to 2Mbytes?

mime-types-mini is using < 200Kbytes for the entire dataset.

I'm keen to push forward but I'd like to work with you to update mime-types gem.

I'm just not quite sure how to fit your existing APIs around what I see as being what people are actually doing with the gem. Here is a rough assessment:

The mail gem doesn't know how to load MIME::Types: https://github.com/mikel/mail/blob/a217776355befa3d8191c4bd3c1fad54e0e27471/lib/mail.rb#L11-L16 - the odd require and fallback should never have been necessary.

The mail gem doesn't know how to handle MIME::Types: https://github.com/mikel/mail/blob/a217776355befa3d8191c4bd3c1fad54e0e27471/lib/mail/attachments_list.rb#L97-L102 - the odd encoding issue and use of .first - i.e. I've got this extension.. let's basically just choose any related content-type.. is there something special about the first one?

The mail gem doesn't know how to match MIME::Type to a given record: https://github.com/mikel/mail/blob/a217776355befa3d8191c4bd3c1fad54e0e27471/lib/mail/attachments_list.rb#L61-L65 - again just using the first one that comes up. Is this the right one? It might be in some cases, and not in others. Finally, the only purpose of this is for guessing the encoding of the file.

Well, I think I'd like to look at how other projects are using MIME::Types before making a decision about the kind of API that's required. But, I think the current API is ambiguous given how it's being used and poorly aligning with requirements.

@halostatue Just so we are on the same page, I think you are an awesome developer and this critique is purely technical. Please don't take anything here personally.

halostatue commented 8 years ago

So, I have one hard requirement: nothing compiled goes into mime-types proper. Period. I don't build extension-based libraries for a lot of reasons, but mostly because doing so breaks compatibility for all Rubies and platforms.

As to your notes about how mail uses mime-types:

mime-types now uses a columnar store, not the JSON file. This capability was introduced in mime-types 2.6.1 and made default in 3, but mail support{s,ed} all versions of mime-types from 1 to 3. When going from 2 to 3, I didn’t want to break the lower-memory (columnar) use case, enabled by the requiring mime/types/columnar.
The encoding bit is news to me, because I only use UTF-8 for everything, and have since mime-types 2—so this may be compatibility with previous versions of mime-types. Also, the use of #first here is an algorithmic mistake, IMO, but it’s something that a lot of people who use mime-types do:
```
pry> MIME::Types.type_for('doc')
=> [#<MIME::Type: application/msword>,
#<MIME::Type: application/word>,
#<MIME::Type: application/x-msword>,
#<MIME::Type: application/x-word>,
#<MIME::Type: text/plain>]
```
All mime-types actually tells people is “here are some MIME content types associated with the extensions .doc”. They are sorted alphabetically, then by various other factors (and, frankly, I see a bug with the way that the priority sort is done, because it should be application/msword, text/plain, application/word, application/x-msword, and application/x-word—all registered types should be sorted before unregistered types, so it should be sorted on registered? then alphabetical). On VMS, .doc was the standard text file extension. So…using #first here is wrong but common. Using the version that is most appropriate to your use case is right. There are some thoughts I have about having additional parameters or data to try to add some real heuristics…but there’s no good usage data (yes, logically we can guess that .doc is probably application/msword, but what data can we have that allows a computer to do so?).‡
This is the same thing, but in reverse; it is a little more likely to be correct to use #first here, especially with mime-types 3 and later, since support for platform MIME types have been removed (it used to be possible to have two variants of text/plain that were differentiated by system (e.g., VMS vs not). Even now, it is possible to have two variants of text/plain:
```
MIME::Types['text/plain'].size #=> 1
t = MIME::Types['text/plain'].first
t.extensions = %w('xtx')
MIME::Types.add(t)
MIME::Types['text/plain'].size #=> 2
```
That should print a warning, but for some reason it isn’t and that appears to be a minor regression. Whether it’s a good idea or not is a totally different question.

The use of #first is problematic here, and you’ll also see it with t.extensions.first, which I have simplified to t.preferred_extension…except that there is also t.preferred_extension= allowing the mime data to explicitly include the preferred extension. The data hasn’t yet been updated to provide this, but the logic so far is “prefer the extension that is listed first”.

mime-types was created in 2003 because I needed it for a script that I was writing and I ported it from Perl. It has turned out to be very useful, and I want to keep most of the useful behaviours and make it a full consumer of the mime types data. However, over the last 13 years, how people use it does not quite match what it does. They do, however, use the mutability to add their own records (see various closed bugs in this repo before I moved data out).

I split out mime-types-data because I wanted to enable people to make their own interfaces that match what they expect using data that is in a fairly regularized format (even the columnar data is in a regularized format). I have plans on making mime query APIs in a lot of different languages that either consume the columnar data or the JSON file. There’s a lot more data provided in mime-types than people are using, and I think that the data is a good thing.

ioquatix commented 8 years ago

mime-types-data is awesome and really useful, so thanks for taking the initiative to do that.

Perhaps what would make the most sense is a light weight C (or C++ with C interface) library which could easily be invoked either via an extension or via FFI. You could use the FFI backed library where possible or a Ruby/Python implementation if that isn't possible.

I fundamentally have an issue with mime types being mutable and the mime type database being mutable at a global level. This is touched on briefly here https://github.com/mime-types/ruby-mime-types/issues/45

The way I think this should work:

module Mime
    class Type
        def initialize(content_type, encoding, extensions)
            @content_type = content_type
            @encoding = encoding
            @extensions = extensions
        end

        attr :content_type
        attr :encoding
        attr :extensions
    end

    class Container
        def initialize(next = nil)
            @next = next

            @extensions = {}
            @content_types = {}
        end

        attr :extensions
        attr :content_types

        # Add a mime type to this container
        def << mime_type
            mime_type.extensions.each do |extension|
                @extensions[extension] = mime_type
            end

            @content_types[mime_type.content_type] = mime_type
        end

        def content_type_for_extension(extension)
            @extensions.fetch(extension) { @next.content_type_for_extension(extension) }
        end

        def mime_type_for_content_type(content_type)
            @content_types.fetch(extension) { @next.content_type_for_extension(extension) }
        end

        alias [] mime_type_for_content_type
    end

    class RFCBlahData
        def initialize(next = nil)
            @next = next
        end

        # Implemented in C land or wherever.
        # def content_type_for_extension(extension)
        # def mime_type_for_content_type(content_type)
    end
end

my_mimes = Mime::Container.new(Mime::RFCBlahData)
my_mimes << Mime::Type.new("foo/bar", "binary", "bar")

foo_bar_mime_type = my_mimes["foo/bar"]
text_plain_mime_type = my_mimes["text/plain"]

Ideally the library (e.g. mail) provides a way to specify my_mimes to be used for a given mail connection, or defaults to some sane global mime container.

halostatue commented 8 years ago

There’s a lot more data than what you have specified, and given that I don’t have an automated process for updating the data (since the data is ultimately generated by humans, it requires some intervention periodically), I’m not comfortable making the registry or objects immutable as part of mime-types.

I might be able to be convinced that there can be something mutable that can be frozen after a configuration phase, but again—that would have to be a major version bump as it would be incompatible, or it would need to be a separate gem—which I would be happy to have in the mime-types organization if you want it here.

One of the things I’m stumbling toward (and have been for a while, pretty much since you submitted #45 a few years ago) is making it so that MIME::Types (the container class, not the default registry) is easier to use with a subset of the data, and maybe make it easier to provide a subset of the data as it applies to your application’s configuration without needing to load the entire 9,000 mime-type default registry. At the same time, I also have requests like #67 and (to a degree) #115 that look at adding features and details in the types and the registry. Most people use mime-types through the MIME::Types.type_for and MIME::Types.[] interfaces without even realizing that they are proxies around instance methods MIME::Types#type_for and MIME::Types#[] for the default registry.

ioquatix commented 8 years ago

There’s a lot more data than what you have specified

i totally agree on this point. But, keep in mind most usage (e.g. mail, which is a big one) don't use any of this additional data. It is a different use case which is not well served by this library. There are too many decisions which are not being made correctly (e.g. the usage of #first).

I'm still thinking about this, but a gem which focuses purely on content types and file extensions would be really useful.

Once you have the content type string, it should be easy to use the mime-types gem for further information if required.

ioquatix commented 8 years ago

So, here is my thoughts.

A new gem, content-types which focuses purely on mapping content type to file extension and file extension to content type.

By default, it doesn't include any data, but only data structures.

Then, pull in additional dependencies, e.g. content-types-iana. This gem has a native pure ruby implementation but tries to use a compiled extension if possible.

Content::Types::IANA would be it's own immutable registry, ideally using the C implementation.

The fields we have in the C registry are largely irrelevant to memory consumption, but there is a case to be made for the most common use case - e.g. don't include obsolete content types (could be a separate registry Content::Types::IANA::Obsolete for example.

I've deliberately preferred content-types over media-types because I've already implemented MediaTypes according to the relevant RFCs (including correct parsing) in http-accept gem. But, I guess this could be pulled out into a separate media-types gem. If you'd be willing to support this under the mime-types org, I'd be happy to do this with your support. We could also move the http-accept gem here too.

In my mind (and generally from RFCs), a media type is a concrete "text/plain". It's the same thing as a content type which is confused because it's the name of the header in http. A media range is "text/*" - i.e. it specifies a range of valid media types. The name "mime-type" is largely replaced with "media-type" in modern terminology. It's good that any gems we make help users understand the modern terminology. Hence it might make sense to use media-types as a gem name. Thoughts?

halostatue commented 8 years ago

Note that, for the most part, the IANA data does not include extensions. I think I’ve seen it in some of the data, and in a few RFCs, but all of the extension data that I have is pulled from other sources (including the Apache list and various PRs over the years) as well as a bit of judgement.

More people want some of the unregistered types than you think (it’s fairly easy to get an IANA type added in the */vnd.* namespace, but most people don’t bother), because it’s practical to the solutions that they want.

I’ll need to think about what you’ve written a bit more, but it sounds mostly reasonable, and content-types is a reasonable name. I’m not as sure about the separate gems for the registry; it feels unnecessary. Separate requires, sure. Separate gems?

ioquatix commented 8 years ago

I'll think a bit more about separate gems or not. I just felt like it would be nice to keep compiled code out of the main repo, but I guess it doesn't matter too much.

ioquatix commented 8 years ago

It would also mean, to a certain extent, that data and functionality could be updated independently.. which might not be a bad thing.

halostatue commented 8 years ago

Let me clarify: I am not suggesting content-types contains the compiled data. I am suggesting that having multiple data gems is the problem.

ioquatix commented 8 years ago

Okay, so you'd prefer to see content-types-data which contains, say Content::Types::Data::IANA and Content::Types::Data::Unofficial (and perhaps Content::Types::Data::All)

halostatue commented 8 years ago

There’s a minimum amount of metadata per gem, and each gem does slightly slow down startup because of extra paths, etc. Better to include all the data in one place, even if it is multiple .so/dylib files that get loaded for better space use.

Have you considered capnproto encoding data files, rather than compiling the data? You can then provide the data encoded once because it’s cross-platform (and I’m even happy to include those files as part of mime-types-data).

SamSaffron commented 7 years ago

A new gem, content-types which focuses purely on mapping content type to file extension and file extension to content type.

Note, for mini-mime I needed to do content-type -> extension and extension -> content-type lookups cause both usages were in the wild.

ioquatix commented 7 years ago

Yes both are supported

stereobooster commented 7 years ago

capnproto sounds fantastic. Any progress on that? There is some kind similar optimization done in this PR where JSON file was replaced with Protocol Buffers. Other option would be MsgPack (note ruby implementation not optimal).

ioquatix commented 7 years ago

I think a compiled solution for this data is probably the best option, it's several orders of magnitude more efficient both in space and time.

stereobooster commented 7 years ago

You need proof for such claims. Capnproto supports zero-copy and memory map, so it can be as fast as compiled code. Advantage of having data as separate file is obvious - it can be changed without compilation step

ioquatix commented 7 years ago

Yeah, and if you read earlier, you'll see this: https://github.com/ioquatix/mime-types-mini which is: 200Kbytes vs 6Mbytes of memory used, and faster too since it uses a perfect hash function.

stereobooster commented 7 years ago

did you take a time to read about capnproto? It is binary format which can be directly mapped runtime memory

ioquatix commented 7 years ago

Yes, I've mapped mesh data directly to Vulkan so it renders fast. I know my shit. Do you know what a perfect hash function is?

ioquatix commented 7 years ago

The perfect hash lookup is about 4x faster than a native Ruby hash table. So, I'm not sure what you are proposing, perhaps you can implement it and put your ideas on the table so we can compare it.

ioquatix commented 7 years ago

By the way, I'm not trying to be an asshole, I just got a bit frustrated with the direction of the conversation. I think that there is plenty here which backs up my (completely reasonable) assertions and I don't think you have a bad idea, I understand the value of not needing something to be compiled, but if you are serious about it then just implement the hell out of it, and come back here once you have some good benchmarks.

stereobooster commented 7 years ago

Yes, now I see what is the difference. I thought most of the performance/memory gain was from just the fact of using C instead of Ruby. So in this case using external binary file would make not much difference. But you went one step further and used perfect hash.

stereobooster commented 7 years ago

So again, to be clear. What I and @halostatue suggest is to use capnproto to save data in external binary file, and use your C extension to load this data (not Ruby implementation), and use general hash function to search this data. Main memory gain is from the fact that you use effective C in memory representation instead of Ruby. Hash function itself doesn't impact on memory consumption it affects on number cycles required to search data in table. Please correct me if you see flaw in what I'm saying. General hash function will perform a little bit worse than perfect hash, but I don't think this will be the problem, because problem that you address is memory no CPU cycles.

ioquatix commented 7 years ago

So again, to be clear. What I and @halostatue suggest is to use capnproto to save data in external binary file, and use your C extension to load this data (not Ruby implementation), and use general hash function to search this data.

You won't save that much memory this way, and their is a pretty big performance overhead loading the data into a Ruby hash.

Main memory gain is from the fact that you use effective C in memory representation instead of Ruby. Hash function itself doesn't impact on memory consumption it affects on number cycles required to search data in table.

The perfect hash function is a signifiant gain in memory because it essentially compresses the hash keys. The hash function itself doesn't affect memory consumption directly, but the representation in memory definitely does. Additionally, loading a dynamically shared library means it would be shared between processes as static immutable code/data. That's also a huge win.

General hash function will perform a little bit worse than perfect hash, but I don't think this will be the problem, because problem that you address is memory no CPU cycles.

The problem I address using my approach:

Memory usage significantly reduced, 200Kb vs 6+Mbytes.
Static immutable code/data is shared between processes so this is 200Kb over all ruby processes that load the same shared library.
4x improvement in performance for hash lookup using perfect hash function.
Near zero startup overhead.

stereobooster commented 7 years ago

You won't save that much memory this way, and their is a pretty big performance overhead loading the data into a Ruby hash.

Why Ruby hash? C implementation of hash table.

The perfect hash function is a signifiant gain in memory because it essentially compresses the hash keys.

There is all the same data stored in generated hash table, which would be stored in general hash table e.g. hash_key, original key, value. Plus general hash table will store pointer to next value in the bucket (or something similar depending on implementation).

#line 685 "Extensions.gperf"
      {"Z", "application/x-compressed"},

Main memory improvement comes from C itself, not perfect hash function.

Memory usage significantly reduced, 200Kb vs 6+Mbytes.

Yes and thats fantastic result. Thanks for doing it.

But if you would use all the same implementation + external data file the difference would be the same significant.

ioquatix commented 7 years ago

Why Ruby hash? C implementation of hash table.

Because Ruby hash is generic for Ruby objects. It's not that efficient for a static lookup.

There is all the same data stored in generated hash table, which would be stored in general hash table e.g. hash_key, original key, value. Plus general hash table will store pointer to next value in the bucket (or something similar depending on implementation).

Your right in the sense that all the entries are in the C code. Of course it has to be somewhere! But, your wrong if you think it's anything like a general hash table. There is no hash_key, no probing, no buckets, no next pointer or any other crap like that to slow down the lookup.

But if you would use all the same implementation + external data file the difference would be the same significant.

Nope, it wouldn't. Because there would be a large startup time, and it can't be shared easily using a DSO.

stereobooster commented 7 years ago

Because Ruby hash is generic for Ruby objects. It's not that efficient for a static lookup.

I mean do not use Ruby hash. Use C implementation of Hash table. Like you do now. Use the C implementation of Hash table. But instead of using ideal Hash function use general Hash function.

There is no hash_key, no probing, no buckets, no next pointer or any other crap like that to slow down the lookup.

And you think that this is the source of difference 200k vs 6mb?

It seems you have strong believe that I suggest to use Ruby code for this task. It is not like that.

perfect hash	general hash	Ruby implementation
simplified C hash table	general C hash table	Ruby hash table
200k	?k	6mb

Of course your solution would be the fastest and most optimal, I never argued with that. But suggested solution would be just a bit slower and you can swap data files easily.

ioquatix commented 7 years ago

I mean do not use Ruby hash. Use C implementation of Hash table. Like you do now. Use the C implementation of Hash table. But instead of using ideal Hash function use general Hash function.

Yes, and it would have a large start-up time to populate such a data structure, and it would have a large non-sharable memory allocation. Two VERY big issues people currently face when using large Ruby code bases, so big that people try to work around it using application servers that do "prefork" initialisation.

A perfect hash algorithm has got almost nothing to do with a general purpose hash table data structure. They don't even need to compute a general hash value at run-time. The hash computation is directly tied to the lookup, and it's a direct array lookup with a single key comparison if required.

And you think that this is the source of difference 200k vs 6mb?

Yes, my solution is the fastest, and most optimal, that I could come up with, and that's what I was aiming for. I think if you are interested, why don't you implement a general purpose C hash in a fork of my code, and then compare it. Startup time, lookup time, and memory usage. Keep in mind that in the above memory usage is actually:

200k per DSO context (e.g. OS). 6Mbytes per process. Whatever you are proposing with a general purpose hash table, is probably also per-process.

Of course your solution would be the fastest and most optimal, I never argued with that. But suggested solution would be just a bit slower and you can swap data files easily.

Swapping data files is not something people usually do, but I've also suggested a way in which this can be achieved more efficiently above, by using nested wrappers, which I think is actually the best solution, rather than having users modify global state.

I appreciate you taking the time to think about this issue. Let's work towards something that makes sense.

ioquatix commented 7 years ago

Here is what I think makes sense.

A data in some standard format: JSON, MsgPack, YAML, XML, CSV, TSV, whatever as long as it's standard and easily loadable. Don't care about performance much. Ideally, dataset is divided based on where it comes from, e.g. IANA, unofficial, etc. These are semi-immutable by gem version. This already exists as mime-types-data which is awesome, but I think there is quite a bit of junk in there that could be organised better?
A tool to generate native code for efficient, immutable content type lookups. Already done POC.
A pure ruby tool to do the same lookups. Already exists in several places.
A general purpose interface for doing this, as well as doing nested lookup as proposed. Doesn't really exist.

We would then expose MIME::Types via the same interface.

People who use MIME::Types would have a drop-in replacement which is more efficient in both space and time. Code that uses MIME::Types should expose configuration/arguments which allows for the injection of different providers. Given how hard it would be to get everyone to change from using MIME::Types, I'm not sure which strategy is best, but one of:

MIME::Types gem becomes a data-agnostic gem. It provides a light-weight interface as it currently does for constructing a mime type registry, and perhaps exposes it at a global level. The backend can be replaced by a user simply saying something like MIME::Types.provider = MIME::Types::Mini.new or something to that effect.
A new gem which at a minimum implements the abstract operations required, and then make it possible to use MIME::Types as a provider, as well as other gems.

SamSaffron commented 7 years ago

@ioquatix maybe it makes sense to release a non-specific gem here, example possible interface:

h = BigHash.load(filename)
h["test"]
"value"

Then this pattern and encoding structure can be reused in other places that need access to a static hash of sorts, for example, yaml translations have the exact same problem of consuming WAY too much memory these days. So many RVALUES are just sitting there being scanned every major GC for zero value.

Advantage of this kind of approach is that it can have a naive Ruby implementation and high performance MRI or JRuby implementation.

It would also be significantly simpler to integrate into this project cause it could become an explicit dependency.

ioquatix commented 7 years ago

I did think about this actually.

It would be nice.

But it would have to be compiled on the fly. It's not impossible, it's just a lot of work.

Honestly, there is another question to ask - is mime-types simply loading too much unnecessary data in the first place?

SamSaffron commented 7 years ago

@ioquatix 99.99% of the time the consumers don't need the vast army of information mime-types provides... it is why I created https://github.com/discourse/mini_mime

The 99.99% case is that people include the mail gem in rails and it in turn includes mime-types just to do a couple of very specific lookups, under pretty specific conditions. The mail gem now uses mini_mime which means that future versions of rails will not depend on full mime types gem.

ioquatix commented 7 years ago

Thanks for your effort making mini_mime. However, it appears the design is still fundamentally broken due to the dependence on global state. The mail gem should have an argument or configuration parameter which takes in a class implementing a standard API (e.g. perhaps the one you provide). It should be up to the user to provide a mime provider, but there could be a sane default, e.g. on a global provider. But it should be possible to provide a custom provider. Was this ever done?

SamSaffron commented 7 years ago

However, it appears the design is still fundamentally broken due to the dependence on global state.

Can you explain this? I am not following, mini_mime loads nothing into memory on boot and uses an LRU cache to cycle stuff in as needed topping up at about 400 RVALUEs in memory.

For context:

User.first in Rails is around a 400 RVALUE allocated operation.

ioquatix commented 7 years ago

Okay, well, you encourage people to use the api: https://github.com/discourse/mini_mime/blob/356b01907d11e30a39ab16f4bace2aa9bd80d699/lib/mini_mime.rb#L4-L11

How does one add a new mime type?

How does one remove an existing mime type?

How does one provide a subset of mime types to the mail gem, and a different subset to some other gem?

SamSaffron commented 7 years ago

sure we could expand the interface to support this or add a general interface to the mail gem, its not a problem that was solved cause nobody had it :)

ioquatix commented 7 years ago

As you said yourself, most people don't need 99% of what mime-types provides. So, your not really solving the problem at all, your just making the data set smaller. The problem of the DESIGN still exists, and it's affecting everyone.

mime-types / ruby-mime-types

Memory usage/overhead? #123