Importing a large linux perf file will crash the application

Goose97 commented 1 year ago

Hi, thank you for your work on this awesome project.

Back to the story, I was importing a large file (linux perf tool format), around ~1.3 GB and it crashed the application. The root cause is this line. When we're splitting the file into blocks, a block size can exceeds V8 strings limit. The situation is similar to #385.

My plan is:

change splitBlocks signature to

splitBlocks(contents: TextFileContent): TextFileContent[]

pending will now use StringBackedTextFileContent | BufferBackedTextFileContent. It will initially use StringBackedTextFileContent for performance then fallback to use BufferBackedTextFileContent if the block size exceeds the limit. Something like this
```
try {
pending += line
} catch (error) {
if (check(error)) {
// convert to BufferBackedTextFileContent and continue
}
}
```
To do this, we might need to add some methods to TextFileContent interface:
- isEmpty(): to replace pending.length > 0
- append(): to replace string concatenation

Another approach that I'm considering is to implement a generic splitBy(delimiter: string) for TextFileContent. I haven't thought it through yet but it seems like a non-trivial thing.

The file is fairly large file so I won't attach it here. I still keep it in my computer in case you want to have a look.

jlfwong commented 1 year ago

Hey @Goose97!

Thanks for the very clear description of the problem you're trying to solve, and the links to the specific parts of the code!

If I'm understanding the problem correctly, I think there's a simpler solution that's probably more performant.

The splitBlocks function doesn't really need to exist at all. It was just a simple way to reason about the problem.

Instead, I think we can change the importer to just operate on the output of splitLines() directly without using \n\n delimited blocks as an intermediate. This also has the benefit of not doubling the memory requirements.

If we do your proposed change, I think we end up with the 1.3GB in memory three times: once as a buffer, once as the return value of .splitLines() and once as the TextFileContent[]. If we skip the splitBlocks, then we just have the first two.

Really, ideally, we'd have only the one in the buffer and splitLines() should return an iterator instead, then we'd only have the 1.3GB in memory once. That's a separate issue than the one you're trying to solve though.

Goose97 commented 1 year ago

Hi @jlfwong,

That makes a lot of sense. I'll try an PR, going the way you suggested.

I guess I'll go ahead and convert the splitLines() to return an iterator first (neat idea btw), then come back to this issue. Is that cool?

jlfwong commented 1 year ago

I guess I'll go ahead and convert the splitLines() to return an iterator first (neat idea btw), then come back to this issue. Is that cool?

Yeah, if you'd like to tackle that, then please go ahead!

Please make it a separate PR from the addressing the issue described here (so one of the iterator, one for the perf file fix)

If, in the process of doing that, you realize it's kind of a pain to do, it's also fine to address the linux perf-file specific issue without solving the iterator.

jlfwong / speedscope

Importing a large linux perf file will crash the application #433