mlomb / chat-analytics

Generate interactive, beautiful and insightful chat analysis reports
https://chatanalytics.app
GNU Affero General Public License v3.0
645 stars 48 forks source link

Big export (>20GB) has broken data #83

Closed AmazTING closed 1 year ago

AmazTING commented 1 year ago

When compacting message data, if it is larger than 16,776,540 messages, the program will crash due to the NodeJS maximum map size being exceeded. Below is the error (with some information redacted, like my windows username, for privacy reasons):

Compacting messages data 16.776.540/18.172.487 C:\Users(redacted)\node_modules\chat-analytics\dist\pipeline\process\DatabaseBuilder.js:306 messageAddresses.set(id, finalMessages.byteLength); ^

RangeError: Map maximum size exceeded at Map.set () at DatabaseBuilder.compactMessagesData (C:\Users(redacted)\node_modules\chat-analytics\dist\pipeline\process\DatabaseBuilder.js:306:34) at DatabaseBuilder.build (C:\Users(redacted)\node_modules\chat-analytics\dist\pipeline\process\DatabaseBuilder.js:331:40) at generateDatabase (C:\Users(redacted)\node_modules\chat-analytics\dist\pipeline\index.js:15:20) at async C:\Users(redacted)\node_modules\chat-analytics\dist\lib\CLI.js:93:16

Node.js v18.10.0

mlomb commented 1 year ago

Wow! Thats a lot of data! How many messages are there in your export? (beyond the 16M) Sorry I noticed the log.

I think we could abstract the maps with some interface that creates other maps on demand.

hopperelec commented 1 year ago

It could be better to use a custom map implementation altogether, that way we could use a custom hashing function to make it faster (I'm guessing node.js uses the full object hashcode by default) and perhaps more memory efficient

mlomb commented 1 year ago

I think adding a second or third map will be enough for us, 32M or 48M messages are a lot of messages, I don't expect to have more than that, I never tried it with more than 12M. Also It would be hard to do a better job than V8, it is really good as it is (in speed and memory).

AmazTING commented 1 year ago

Sorry, I just wanted to note, this isn't all the messages I have. Some were still exporting. In total, it would be ~42M messages.

mlomb commented 1 year ago

@AmazTING can you try uploading the messages to the PR? 😄 (https://pr-84.chat-analytics.pages.dev)

In total, it would be ~42M messages

😳

That will require a lot of memory, and even if it is exported, I can't promise that the report will load 😟

AmazTING commented 1 year ago

When using the webpage, it crashes with "Aw Snap! Something went wrong displaying this webpage". I am also trying to use the CLI with the new commits, I will update soon.

EDIT: Nope, still an error, but I now get (again my username is redacted):

Compacting messages data 23.335.577/23.335.577 Compressingnode:internal/errors:484 ErrorCaptureStackTrace(err); ^

TypeError [ERR_ENCODING_INVALID_ENCODED_DATA]: The encoded data was not valid for encoding iso-8859-10 at new NodeError (node:internal/errors:393:5) at TextDecoder.decode (node:internal/encoding:433:15) at base91encode (C:\Users\Downloads\chat-analytics-main\dist\pipeline\compression\Base91.js:106:43) at compressDatabase (C:\Users\Downloads\chat-analytics-main\dist\pipeline\compression\Compression.js:37:45) at generateReport (C:\Users\Downloads\chat-analytics-main\dist\pipeline\index.js:22:60) at C:\Users\Downloads\chat-analytics-main\dist\lib\CLI.js:94:53 { errno: 1, code: 'ERR_ENCODING_INVALID_ENCODED_DATA' }

EDIT 2: Actually, this may be a problem with the dataset, not the program. I'll look into it.

AmazTING commented 1 year ago

So, my previous comment was correct, it is a problem with the data. However, after fixing this issue, there were still a few problems after the generation of the HTML file.

  1. The graph didn't function - it simply said "Check the console for details". When I checked the console, I did find the error:

TypeError: Cannot read properties of undefined (reading 'v') at authors (data:application/jav...zEpfSkoKTs=:1:16285) at Object.fn (data:application/jav...zEpfSkoKTs=:1:16271) at data:application/jav...zEpfSkoKTs=:1:35535 at new Promise () at s (data:application/jav...zEpfSkoKTs=:1:35280) at self.onmessage(data:application/jav...zEpfSkoKTs=:1:36472)

  1. The breakdown of 'Total messages sent' displayed numbers that shouldn't be possible for certain categories:

Total messages sent 23,202,628 ✏️ with text - 16,082,245 🔗 with links - 387,643 📷 with images - 5,430,907,242,766 👾 with GIFs - 2,786,796,325,385 📹 with videos - 2,220,680,240,802 🎉 with stickers - 2,185,162,930,983 🎵 with audio files- 2,052,764,210,281 📄 with documents - 1,884,002,424,484 📁 with other files - 1,938,157,507,785

  1. The 'edited message' section displayed the highest edit time difference as 49613d, 5h (135 years), which also shouldn't be possible

  2. The activity by hour only displayed amounts of messages sent over 740,000 (anything other than that was simply blank)

  3. (unsure) The activity by day displayed an overwhelming amount for Thursday, despite, to my knowledge, there not being any kind of influx of messages on Thursday specifically (it was 3x any other day at 7 million, compared to the average 2.5 million for other days)

  4. The timeline of 'active authors over time, by month', also had an error:

TypeError: Cannot read properties of undefined (reading 'add') at channels (data:application/jav...zEpfSkoKTs=:1:20047) at t.filterMessages (data:application/jav...NzEpfSkoKTs=:1:7827) at data:application/jav...zEpfSkoKTs=:1:19961 at Array.map () at Object.fn (data:application/jav...zEpfSkoKTs=:1:19879) at data:application/jav...zEpfSkoKTs=:1:37112 at Generator.next () at data:application/jav...zEpfSkoKTs=:1:35535 at new Promise () at s (data:application/jav...zEpfSkoKTs=:1:35280)

  1. 'Sentiment' also has a very similar error, to the 'messages' graph and the 'active authors' graph, probably caused by the same issue

  2. The 'interaction' section is quite broken. The authors that got the most reactions graph has an error message, the top reacted messages graph has an error message as well, and the 'most mentioned' graph displays amounts that shouldn't be possible for seemingly random users (over 500 billion)

  3. The 'links' section has a similar problem, displaying numbers that shouldn't be possible (1 trillion for youtube.com alone)

  4. All the 'emoji' section's graphs have errors

  5. The 'language' section crashes the page

attached is also the file itself as a zip (the data contains no private information, so nothing is redacted) (as a google drive file, because it is too big for github)

https://drive.google.com/file/d/1EqnnZT_IYjaIJVEBDpEBpWdLaA2DCsab/view?usp=sharing

mlomb commented 1 year ago

Ok I've been investigating. It seems message serialization breaks after the message 18084914 (that would explain why we see 2.5M each day, 2.5M * 7 days = 17.5M and the everything broken) using this:

let i = 0, done = false;
r.filterMessages((e) => {
    i++;
    if (done) return;
    if (e.dayIndex > 2100) {
        console.log(i);
        done = true;
    }
}, e, t);

So I started checking the serialization...

I checked if the default bits were too low, but they seem in range for this export: https://github.com/mlomb/chat-analytics/blob/452037f2ada3cff2fc656dc75a2938d4e3324094/pipeline/serialization/MessageSerialization.ts#L30-L44

Also looked very close the IndexCount serialization. I ran a grid search with all combinations of [index, count] from 0 to 65k (num bits=16) and it works as expected for count >= 1 (and all indexes). For count=0 it fails, as it should not happen, so I added some error checking:

https://github.com/mlomb/chat-analytics/pull/84/commits/d814387ec923c9842a4570166c389d2503329883#diff-d89d5e04d27b5a3d41e8b31ee25202b6a70ea6adb3be4ceba35425e02b909a39R57-R61

@AmazTING could you try to export the chat again? (I know its a pain, but I have no other way to know if this is the problem) I made the changes in the same branch as before (https://pr-84.chat-analytics.pages.dev). If this is the problem, it should crash with one of those exceptions. If it does, send the stack so I can track down what IndexCount it is.

I can't can't find anything else that could break 😞 Again thanks for helping 😄

AmazTING commented 1 year ago

Nope, I still get the same issues - report loads, but many things are broken. Sorry for not mentioning this earlier, but the report is multiple JSON files, is it possible that might be the issue?

mlomb commented 1 year ago

Ok, so its not that Can you tell me:

From the report I tought the only channel was "offtopic"

AmazTING commented 1 year ago

Yes, the only channel is off-topic, but it was so large I had to break it into 14 JSON files of varying sizes from ~600MB to ~4GB.

mlomb commented 1 year ago

You know how many messages did you export from DCE? Does chat-analytics get at least that right? ~23M according to the report

A good thing to try next may be increasingly try to generate reports for 1, 2, 3, etc files, to know exactly at which point it breaks and if its a specific file or a volume problem. This may take a while, could you tell me which server it is so I can export it myself? (or if you have the time, upload the raw files for me). I'm assuming a server this big is public and privacy is not a concern.

14 JSON files of varying sizes from ~600MB to ~4GB

~50GB is a lot for chat-analytics I never thought it would analyze something this big O_O

Also... how much RAM did you need for this? 👀

AmazTING commented 1 year ago

Yep! The server is discord.gg/minecraft and #off-topic, but since it may take a while to export (took me >100 hours), I'll also give a google drive link to the zipped jsons here:

https://drive.google.com/file/d/1f6pheEYGw1G0mBxvxgjQP46-EUAheWtP/view?usp=sharing

mlomb commented 1 year ago

After a few failed attempts at generating the report in the browser (Out of memory) I managed to replicate the issue using the CLI. I'll update when I discover something

mlomb commented 1 year ago

Update! I managed to exclude some files and still reproduce the problem. After this I started logging and adding asserts everywhere (also catched a few unrelated bugs). I think I found the issue! We use BitAddress: number which can store integers up to 2^53 but when making bitwise operations they are performed using only 32 bits, breaking serialization. The thing is that we address by bit and not by byte, so we reach 2^32 a lot faster. When the buffer reaches something like 280MB we are around 2^31 bits which is what happens in this big export.

Look at the following function:

https://github.com/mlomb/chat-analytics/blob/452037f2ada3cff2fc656dc75a2938d4e3324094/pipeline/serialization/BitStream.ts#L121-L136

If we try to serialize a variable-size number like BitAddress it breaks completely:

https://github.com/mlomb/chat-analytics/blob/452037f2ada3cff2fc656dc75a2938d4e3324094/pipeline/serialization/MessageSerialization.ts#L64

After fixing the function using BigInts (if maxBits > 30), the errors persists elsewhere, I will keep looking for places where BitAddress is being used in bitwise operations.

mlomb commented 1 year ago

Good news! We can close the issue!

The other problem was calculating the 32-byte aligned address from a BitAddress using bitwise operations:

https://github.com/mlomb/chat-analytics/blob/452037f2ada3cff2fc656dc75a2938d4e3324094/pipeline/serialization/BitStream.ts#L85-L86

Here is the export for all the files you shared: report.zip (300MB)

Done: 36:00.760 (m:ss.mmm)
Report data size: 373 MB
Report HTML size: 374 MB
The report contains:
 [*] 23335577 messages
 [*] 343852 authors
 [*] 1 channels
 [*] 1 guilds

Thanks for taking the time to report this and upload the files for me :)

Edit: fix available in v1.0.2 and in the live app!