Closed AmazTING closed 1 year ago
Wow! Thats a lot of data!
How many messages are there in your export? (beyond the 16M) Sorry I noticed the log.
I think we could abstract the maps with some interface that creates other maps on demand.
It could be better to use a custom map implementation altogether, that way we could use a custom hashing function to make it faster (I'm guessing node.js uses the full object hashcode by default) and perhaps more memory efficient
I think adding a second or third map will be enough for us, 32M or 48M messages are a lot of messages, I don't expect to have more than that, I never tried it with more than 12M. Also It would be hard to do a better job than V8, it is really good as it is (in speed and memory).
Sorry, I just wanted to note, this isn't all the messages I have. Some were still exporting. In total, it would be ~42M messages.
@AmazTING can you try uploading the messages to the PR? 😄 (https://pr-84.chat-analytics.pages.dev)
In total, it would be ~42M messages
😳
That will require a lot of memory, and even if it is exported, I can't promise that the report will load 😟
When using the webpage, it crashes with "Aw Snap! Something went wrong displaying this webpage". I am also trying to use the CLI with the new commits, I will update soon.
EDIT: Nope, still an error, but I now get (again my username is redacted):
Compacting messages data 23.335.577/23.335.577 Compressingnode:internal/errors:484 ErrorCaptureStackTrace(err); ^
TypeError [ERR_ENCODING_INVALID_ENCODED_DATA]: The encoded data was not valid for encoding iso-8859-10 at new NodeError (node:internal/errors:393:5) at TextDecoder.decode (node:internal/encoding:433:15) at base91encode (C:\Users\Downloads\chat-analytics-main\dist\pipeline\compression\Base91.js:106:43) at compressDatabase (C:\Users\Downloads\chat-analytics-main\dist\pipeline\compression\Compression.js:37:45) at generateReport (C:\Users\Downloads\chat-analytics-main\dist\pipeline\index.js:22:60) at C:\Users\Downloads\chat-analytics-main\dist\lib\CLI.js:94:53 { errno: 1, code: 'ERR_ENCODING_INVALID_ENCODED_DATA' }
EDIT 2: Actually, this may be a problem with the dataset, not the program. I'll look into it.
So, my previous comment was correct, it is a problem with the data. However, after fixing this issue, there were still a few problems after the generation of the HTML file.
TypeError: Cannot read properties of undefined (reading 'v')
at authors (data:application/jav...zEpfSkoKTs=:1:16285)
at Object.fn (data:application/jav...zEpfSkoKTs=:1:16271)
at data:application/jav...zEpfSkoKTs=:1:35535
at new Promise (
Total messages sent 23,202,628 ✏️ with text - 16,082,245 🔗 with links - 387,643 📷 with images - 5,430,907,242,766 👾 with GIFs - 2,786,796,325,385 📹 with videos - 2,220,680,240,802 🎉 with stickers - 2,185,162,930,983 🎵 with audio files- 2,052,764,210,281 📄 with documents - 1,884,002,424,484 📁 with other files - 1,938,157,507,785
The 'edited message' section displayed the highest edit time difference as 49613d, 5h (135 years), which also shouldn't be possible
The activity by hour only displayed amounts of messages sent over 740,000 (anything other than that was simply blank)
(unsure) The activity by day displayed an overwhelming amount for Thursday, despite, to my knowledge, there not being any kind of influx of messages on Thursday specifically (it was 3x any other day at 7 million, compared to the average 2.5 million for other days)
The timeline of 'active authors over time, by month', also had an error:
TypeError: Cannot read properties of undefined (reading 'add')
at channels (data:application/jav...zEpfSkoKTs=:1:20047)
at t.filterMessages (data:application/jav...NzEpfSkoKTs=:1:7827)
at data:application/jav...zEpfSkoKTs=:1:19961
at Array.map (
'Sentiment' also has a very similar error, to the 'messages' graph and the 'active authors' graph, probably caused by the same issue
The 'interaction' section is quite broken. The authors that got the most reactions graph has an error message, the top reacted messages graph has an error message as well, and the 'most mentioned' graph displays amounts that shouldn't be possible for seemingly random users (over 500 billion)
The 'links' section has a similar problem, displaying numbers that shouldn't be possible (1 trillion for youtube.com alone)
All the 'emoji' section's graphs have errors
The 'language' section crashes the page
attached is also the file itself as a zip (the data contains no private information, so nothing is redacted) (as a google drive file, because it is too big for github)
https://drive.google.com/file/d/1EqnnZT_IYjaIJVEBDpEBpWdLaA2DCsab/view?usp=sharing
Ok I've been investigating. It seems message serialization breaks after the message 18084914
(that would explain why we see 2.5M each day, 2.5M * 7 days = 17.5M and the everything broken) using this:
let i = 0, done = false;
r.filterMessages((e) => {
i++;
if (done) return;
if (e.dayIndex > 2100) {
console.log(i);
done = true;
}
}, e, t);
So I started checking the serialization...
I checked if the default bits were too low, but they seem in range for this export: https://github.com/mlomb/chat-analytics/blob/452037f2ada3cff2fc656dc75a2938d4e3324094/pipeline/serialization/MessageSerialization.ts#L30-L44
Also looked very close the IndexCount
serialization. I ran a grid search with all combinations of [index, count]
from 0 to 65k (num bits=16) and it works as expected for count >= 1
(and all indexes). For count=0
it fails, as it should not happen, so I added some error checking:
@AmazTING could you try to export the chat again? (I know its a pain, but I have no other way to know if this is the problem) I made the changes in the same branch as before (https://pr-84.chat-analytics.pages.dev). If this is the problem, it should crash with one of those exceptions. If it does, send the stack so I can track down what IndexCount it is.
I can't can't find anything else that could break 😞 Again thanks for helping 😄
Nope, I still get the same issues - report loads, but many things are broken. Sorry for not mentioning this earlier, but the report is multiple JSON files, is it possible that might be the issue?
Ok, so its not that Can you tell me:
From the report I tought the only channel was "offtopic"
Yes, the only channel is off-topic, but it was so large I had to break it into 14 JSON files of varying sizes from ~600MB to ~4GB.
You know how many messages did you export from DCE? Does chat-analytics get at least that right? ~23M
according to the report
A good thing to try next may be increasingly try to generate reports for 1, 2, 3, etc files, to know exactly at which point it breaks and if its a specific file or a volume problem. This may take a while, could you tell me which server it is so I can export it myself? (or if you have the time, upload the raw files for me). I'm assuming a server this big is public and privacy is not a concern.
14 JSON files of varying sizes from ~600MB to ~4GB
~50GB is a lot for chat-analytics I never thought it would analyze something this big O_O
Also... how much RAM did you need for this? 👀
Yep! The server is discord.gg/minecraft and #off-topic, but since it may take a while to export (took me >100 hours), I'll also give a google drive link to the zipped jsons here:
https://drive.google.com/file/d/1f6pheEYGw1G0mBxvxgjQP46-EUAheWtP/view?usp=sharing
After a few failed attempts at generating the report in the browser (Out of memory) I managed to replicate the issue using the CLI. I'll update when I discover something
Update!
I managed to exclude some files and still reproduce the problem. After this I started logging and adding asserts everywhere (also catched a few unrelated bugs).
I think I found the issue! We use BitAddress: number
which can store integers up to 2^53 but when making bitwise operations they are performed using only 32 bits, breaking serialization.
The thing is that we address by bit and not by byte, so we reach 2^32 a lot faster. When the buffer reaches something like 280MB we are around 2^31 bits which is what happens in this big export.
Look at the following function:
If we try to serialize a variable-size number like BitAddress
it breaks completely:
After fixing the function using BigInts (if maxBits > 30
), the errors persists elsewhere, I will keep looking for places where BitAddress
is being used in bitwise operations.
Good news! We can close the issue!
The other problem was calculating the 32-byte aligned address from a BitAddress using bitwise operations:
Here is the export for all the files you shared: report.zip (300MB)
Done: 36:00.760 (m:ss.mmm)
Report data size: 373 MB
Report HTML size: 374 MB
The report contains:
[*] 23335577 messages
[*] 343852 authors
[*] 1 channels
[*] 1 guilds
Thanks for taking the time to report this and upload the files for me :)
Edit: fix available in v1.0.2 and in the live app!
When compacting message data, if it is larger than 16,776,540 messages, the program will crash due to the NodeJS maximum map size being exceeded. Below is the error (with some information redacted, like my windows username, for privacy reasons):
Compacting messages data 16.776.540/18.172.487 C:\Users(redacted)\node_modules\chat-analytics\dist\pipeline\process\DatabaseBuilder.js:306 messageAddresses.set(id, finalMessages.byteLength); ^
RangeError: Map maximum size exceeded at Map.set ()
at DatabaseBuilder.compactMessagesData (C:\Users(redacted)\node_modules\chat-analytics\dist\pipeline\process\DatabaseBuilder.js:306:34)
at DatabaseBuilder.build (C:\Users(redacted)\node_modules\chat-analytics\dist\pipeline\process\DatabaseBuilder.js:331:40)
at generateDatabase (C:\Users(redacted)\node_modules\chat-analytics\dist\pipeline\index.js:15:20)
at async C:\Users(redacted)\node_modules\chat-analytics\dist\lib\CLI.js:93:16
Node.js v18.10.0