Slow serialization with large Array data (> 60 Mbytes)

hernanmd commented 1 year ago

It seems that Fuel takes a long time serializing an Array of Arrays containing multiple Strings each one. When serialization passes around 60 Mbytes, it starts writing data in chunks of 100Kbytes and very slowly:

To reproduce please download this Fuel serialization: https://filesender.renater.fr/?s=download&token=b3809275-6258-4ee4-8d93-3b8871b7a0bd

Materialize it in a fresh Pharo 11 image:

FLMaterializer 
    materializeFromFileNamed: 'acImdb_49582_nodups_noartfcts_tokenized.fuel'.

and then from the Inspector window evaluate:

FLSerializer 
    serialize: self 
    toFileNamed: 'acImdb_49582_nodups_noartfcts_tokenized_test.fuel'.

We tested in Pharo 11 under macOS.

theseion commented 1 year ago

Interesting. I don't see any reason for Fuel to do that. There must be another mechanism at play. I currently don't have to look into this tough, in a week or two maybe. @tinchodias, do you have some time to spare maybe?

theseion commented 1 year ago

Can you give me a rough timing estimate of what you mean by "slowly"?

hernanmd commented 1 year ago

Can you give me a rough timing estimate of what you mean by "slowly"?

It was around 100K/sec

theseion commented 1 year ago

This is very interesting. I found the issue to be time spent in FLLargeIdentityDictionary. Replacing one particular use of FLLargeIdentityDictionary with IdentityDictionary cut time down to ~40 seconds. FLLargeIdentityDictionary was specifically designed to be faster for large collections, so it's very curious that this happens.

theseion commented 1 year ago

Still investigating but interesting find: in Pharo 11 the string hashes appear to be badly distributed.

hernanmd commented 1 year ago

Probably @guillep will be interested

theseion commented 1 year ago

I'm not sure the hash distribution is bad per se, it's just that the hashes have only uneven numbers when using modulo 4096.

It looks like FLLargeIdentityDictionary has been slower than we wanted it to be since at least Pharo 6.

guillep commented 1 year ago

Interesting finding :)

guillep commented 1 year ago

I checked the usage of identity dictionaries. I wrote a silly (and unfinished) identity dictionary implementation that keeps a hash table with hash tables inside. The outer hash table uses the modulo of the identity hash and benefits from them being evenly distributed. This silly implementation outperformed the other identity dictionaries for the larger charges (yellow in the plot below). It's 4x faster than FLLargeIdentityDictionary and 50x faster than the standard identity dictionary for the largest set of objects.

This kind of tells me that a better identity dictionary will drastically improve the performance of fuel. I'll add this to the GSOC proposal here: https://gsoc.pharo.org/serialization

theseion commented 1 year ago

Brilliant, thanks!

jecisc commented 1 year ago

Oh nice, not just Fuel. The system relies a lot on IdentityDictionaries :)

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.

seandenigris commented 2 months ago

I recently had a 40MB serialized graph balloon to almost 400MB on disk, which translated into even more memory in the image. It turned out to be caused by a single OrderedCollection that had 99 million strings as items. Sounds potentially related...

By the way, are there any tricks to investigate the cause of such a problem? It took me many hours to track down by selectively nil-ing out parts of the model until I found the collection. I never would've guessed it because it was not part of the model that seemed likely to grow that large. I tried the debugger, but memory use grew to about 16GB before the image started to fail e.g. "file does not exist" errors

theseion commented 2 months ago

Just the standard "tricks":

count number of instances per class
check collection sizes

With v5 I had introduced "graph culling", so you could have configured the depth of the graph, and maybe tried to increase the depth continuously, until you started seeing the issue. Then you could have set a break point at that depth level to inspect candidates.

tinchodias commented 2 months ago

@seandenigris Did you use SpaceTally in your investigation? Anyway, I don't see a clear relation between the too-big .fuel file in your case, and the badly distributed keys in the in the large identity dictionary discovered by Max in this issue.

ELePors commented 2 months ago

Hi i have the same problem with a 10mb fuel file... which takes less than 500ms to read, but almost 59seconds to produce... STON produces a serialization of the same objects in less than 5 seconds... (10 times less).

I tried a workaround published here by Mariano : http://forum.world.st/Fuel-serialize-of-70MB-takes-forever-on-Linux-vs-Mac-td5061002.html

| set dict |
set := FLLargeIdentitySet.
dict := FLLargeIdentityDictionary.
Smalltalk at: #FLLargeIdentitySet put: IdentitySet.
Smalltalk at: #FLLargeIdentityDictionary put: IdentityDictionary.

my serialization with fuel of the same data is now done in 4,6 seconds... ok ... ?

ELePors commented 2 months ago

I don't know the impact on my Pharo image with this kind of workaround... what should we do ?

Eric.

theseion commented 2 months ago

Well, it looks like it's time to switch to the standard collections :)

I quickly checked and all tests pass using the standard collections instead of the Fuel custom ones. You should be good using the workaround you posted.

ELePors commented 2 months ago

i can propose a pull request on Pharo 12 branch

seandenigris commented 2 months ago

Well, it looks like it's time to switch to the standard collections :)

What was the purpose of the custom collections vs. standard ones?

theseion commented 1 month ago

i can propose a pull request on Pharo 12 branch

That would be nice.

What was the purpose of the custom collections vs. standard ones?

Unfortunately, the original research paper doesn't say anything about those. I presume that Fuel exhibited performance issues with large collections due to hash collisions. Meanwhile, the VMs use 64 bit words and longer hashes, which probably has mitigated the issue. Maybe @marianopeck or @tinchodias can share some more details.

theseion / Fuel

Slow serialization with large Array data (> 60 Mbytes) #269