Open hernanmd opened 1 year ago
Interesting. I don't see any reason for Fuel to do that. There must be another mechanism at play. I currently don't have to look into this tough, in a week or two maybe. @tinchodias, do you have some time to spare maybe?
Can you give me a rough timing estimate of what you mean by "slowly"?
Can you give me a rough timing estimate of what you mean by "slowly"?
It was around 100K/sec
This is very interesting. I found the issue to be time spent in FLLargeIdentityDictionary
. Replacing one particular use of FLLargeIdentityDictionary
with IdentityDictionary
cut time down to ~40 seconds. FLLargeIdentityDictionary
was specifically designed to be faster for large collections, so it's very curious that this happens.
Still investigating but interesting find: in Pharo 11 the string hashes appear to be badly distributed.
Probably @guillep will be interested
I'm not sure the hash distribution is bad per se, it's just that the hashes have only uneven numbers when using modulo 4096.
It looks like FLLargeIdentityDictionary
has been slower than we wanted it to be since at least Pharo 6.
Interesting finding :)
I checked the usage of identity dictionaries.
I wrote a silly (and unfinished) identity dictionary implementation that keeps a hash table with hash tables inside. The outer hash table uses the modulo of the identity hash and benefits from them being evenly distributed. This silly implementation outperformed the other identity dictionaries for the larger charges (yellow in the plot below). It's 4x faster than FLLargeIdentityDictionary
and 50x faster than the standard identity dictionary for the largest set of objects.
This kind of tells me that a better identity dictionary will drastically improve the performance of fuel. I'll add this to the GSOC proposal here: https://gsoc.pharo.org/serialization
Brilliant, thanks!
Oh nice, not just Fuel. The system relies a lot on IdentityDictionaries :)
This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.
I recently had a 40MB serialized graph balloon to almost 400MB on disk, which translated into even more memory in the image. It turned out to be caused by a single OrderedCollection that had 99 million strings as items. Sounds potentially related...
By the way, are there any tricks to investigate the cause of such a problem? It took me many hours to track down by selectively nil-ing out parts of the model until I found the collection. I never would've guessed it because it was not part of the model that seemed likely to grow that large. I tried the debugger, but memory use grew to about 16GB before the image started to fail e.g. "file does not exist" errors
Just the standard "tricks":
With v5 I had introduced "graph culling", so you could have configured the depth of the graph, and maybe tried to increase the depth continuously, until you started seeing the issue. Then you could have set a break point at that depth level to inspect candidates.
@seandenigris Did you use SpaceTally in your investigation? Anyway, I don't see a clear relation between the too-big .fuel file in your case, and the badly distributed keys in the in the large identity dictionary discovered by Max in this issue.
Hi i have the same problem with a 10mb fuel file... which takes less than 500ms to read, but almost 59seconds to produce... STON produces a serialization of the same objects in less than 5 seconds... (10 times less).
I tried a workaround published here by Mariano : http://forum.world.st/Fuel-serialize-of-70MB-takes-forever-on-Linux-vs-Mac-td5061002.html
| set dict |
set := FLLargeIdentitySet.
dict := FLLargeIdentityDictionary.
Smalltalk at: #FLLargeIdentitySet put: IdentitySet.
Smalltalk at: #FLLargeIdentityDictionary put: IdentityDictionary.
my serialization with fuel of the same data is now done in 4,6 seconds... ok ... ?
I don't know the impact on my Pharo image with this kind of workaround... what should we do ?
Eric.
Well, it looks like it's time to switch to the standard collections :)
I quickly checked and all tests pass using the standard collections instead of the Fuel custom ones. You should be good using the workaround you posted.
i can propose a pull request on Pharo 12 branch
Well, it looks like it's time to switch to the standard collections :)
What was the purpose of the custom collections vs. standard ones?
i can propose a pull request on Pharo 12 branch
That would be nice.
What was the purpose of the custom collections vs. standard ones?
Unfortunately, the original research paper doesn't say anything about those. I presume that Fuel exhibited performance issues with large collections due to hash collisions. Meanwhile, the VMs use 64 bit words and longer hashes, which probably has mitigated the issue. Maybe @marianopeck or @tinchodias can share some more details.
It seems that Fuel takes a long time serializing an Array of Arrays containing multiple Strings each one. When serialization passes around 60 Mbytes, it starts writing data in chunks of 100Kbytes and very slowly:
To reproduce please download this Fuel serialization: https://filesender.renater.fr/?s=download&token=b3809275-6258-4ee4-8d93-3b8871b7a0bd
Materialize it in a fresh Pharo 11 image:
and then from the Inspector window evaluate:
We tested in Pharo 11 under macOS.