READ/LINES causes data stack overflow with more than 400,000 lines

Running on Windows

Sample file that will throw the assert: http://downloads.majestic.com/majestic_million.csv

>> input: read/lines %majestic_million.csv
Assertion failed!

Program: C:\Users\kealist\Documents\git\kealist\LewisAutomation\r3-681ff1f-debug-cpp.exe
File: ../src/include/sys-stack.h, Line 110

Expression: v->header.bits & NODE_FLAG_CELL && NOT(v->header.bits & NODE_FLAG_END)

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

(cc: @zsx in case he has opinions)

This is a problem with delivering data stack overflow errors. The stack expansion process itself is failing, and it fails halfway through the expansion. So the invariants aren't right when it tries to clean up after the error.

That wasn't hard to fix--done. But now you just get a "data stack overflow" error, which presumably isn't what you wanted.

What's going on here is that for many operations in Rebol (REDUCE, etc.) they build up arrays in the data stack before actually trying to create a series for them. This way the size can be exactly right. e.g. if you say *reduce [1 + 2 3 4]** it knows the original array has 6 elements, but has no way of knowing the final array will have 2 elements. So as it goes, it pushes these to a "hot" common memory area...the data stack, which is repeatedly used for such operations. When it's done it creates an array of exactly length 2, and pops the values into that. The data stack is reused in subsequent operations.

The READ/LINES process similarly doesn't know how many lines the file will have in advance. So each of your lines in the file gets a REBVAL cell being pushed to the stack, prior to a big copy pulling them off.

Regardless of how much memory is actually available, the data stack has had a hardcoded "sanity check" for how big it can get... 400,000 cells. This originated from R3-Alpha, and hasn't been changed:

https://github.com/rebol/rebol/blob/25033f897b2bd466068d7663563cd3ff64740b94/src/include/sys-core.h#L41

Memory-wise, each cell is 4 platform pointers...so this is 6.4MB on 32-bit architectures and 12.8MB on 64-bit architectures.

In this case, all the cells you're dealing with represent STRING! elements that are lines. But you're limited to 400,000 lines, when you've got 1,000,000 of them.

I've been meaning to put in some heuristic where the data stack will shrink itself if it finds it's a certain amount bigger than it needs to be. That's definitely going to be needed if one allows the data stack to grow really large.

We might make this something configurable with a command-line parameter or something. I don't know what the pervasive mindset of people is about whether programs should be limiting memory like this or if they should just wait until an allocation fails. Perhaps--if you get a minute--you could look around to other languages to see what they do in terms of putting limits on themselves...do they default to a limit and you have to override it, or do they default to no limit unless you ask for one?

I ended up having to use Rebol 2 to complete this task because it worked, and I didn't want to implement the scanning through the file. Seems that's one of the "rest of the owl" tasks in all the Rebol documentation. Additionally, I also tried reading in the file as a string and breaking on newlines. split seemed unreasonably slow compared to the parse xxx none in Rebol 2 although I didn't spend much time looking at it. Maybe there is a way to work around this for the lazy like me in R3, but it might be good to have inbuilt support for scanning files if there is going to be a stack overflow issue with large file sizes.

Seems that's one of the "rest of the owl" tasks

That pretty much sums up R3-Alpha.

it might be good to have inbuilt support for scanning files if there is going to be a stack overflow issue with large file sizes.

READ/LINES didn't work on URL!s in R3-Alpha, which is all sort of part of how the "extensible" actions could have an arbitrary number of refinements that weren't processed and that didn't error. By turning up the warnings and noticing unused refinements, this became more explicit:

https://github.com/metaeducation/ren-c/blob/d6acd3619c02d6beabb7ddf2a9738cd20fe3b1ad/src/core/p-clipboard.c#L100

After making it explicit and noticing which refinements exactly that things were ignoring, the /LINES behavior was moved into the central READ. This meant all the port's READ had to do was worry about giving back a complete UTF-8-compatible BINARY!.

Of course this is going to be inefficient. It's reading the whole BINARY!, converting it to a STRING! (unnecessary in UTF-8 Everywhere), pushing all the elements to the data stack and making it grow huge, then copying that data out into a new array. More efficient would be if the data could be processed as it went and go directly into a target array, though (again) that's not easy to get the size right for if you don't have an intermediate stack to use to get the measurement.

And we should probably think about this pervasive concept of "push to data stack memory, then create precisely sized series from that." Once you cross a certain data stack size, you probably want to go ahead and start building memory that your array can just use directly...then if it winds up being too big, let the GC shrink it later.

We wouldn't have to be afraid of making arrays a little too big if there was some background process that could compact them.

Ok, I think I have a general strategy for dealing with this, and some other things...like addressing performance questions that /INTO was trying to do.

Background

The concept of using the data stack as an intermediate holding area for results is that Rebol is trying very hard to allocate a "right size" array for something it doesn't know the count of in advance. If you say:

>> reduce [1 + 3]
== [2]

REDUCE is not psychic regarding the size of the output array. It doesn't know, for instance, if you have done +: 2 and hence it will produce [1 2 3]. The only thing it knows today is that you won't make a longer array than you put in. (Though with new language constructs like macros, that might not be a guarantee. Even so, things like COMPOSE already don't know it would be less...compose [a (b) c] can wind up splicing an arbitrary amount of material in.)

To avoid creating a wrong-sized array it uses a "fast" available place (the data stack) to accumulate the values. Since the data stack is always getting pushed and popped and reused for many purposes, its size isn't usually something to worry about.

When the operation is finished, the number of pushes to the data stack can be counted. REDUCE asks for a precisely-sized block for that much, and copies the results into it. Then it drops the stack back down so it balances, and those cells can be reused for the next stack client.

As it happens, just because it asks for a precisely-sized block doesn't mean it gets one. The memory pools are optimized to hand back only certain sizes. If it gets a memory block that's bigger than it needs, it can do something with that it couldn't do with plain malloc...which is to glom onto that overage as usable capacity if the block gets appended to.

New Plan

I've been hammering on the array allocation logic to try and get it to be faster. Whether it's as fast as it can be, it certainly can be pretty fast. So I think we should:

Gamble, and guess at an output size. It could be a heuristic that learns as it runs, and adapts the guess factor based on how often it winds up making something way too big or too small.
If the output size isn't enough, don't toss aside the array that's too small. Instead, make another request and link it onto the previous one.

This will create discontiguous arrays, which most of Rebol is not designed to process. But we could make it transparent to operations, by saying that if an ordinary array op asks for the head of the data (with the intent to skip around in it via pointer incrementation and direct array indexing), then it waits to consolidate it until that point. Then special operations could be aware of it, and integrate the splices as they enumerate.

So imagine something like compose [a (reduce [1 2 3]) b]. Let's say it guesses wrong, and allocates a 2-unit array for the reduce when it needed at least 3. Instead of putting the 2 in the second slot, it puts a special splice cell...and let's say that points to a new 2-element array it gets to hold the overage, in which it puts 2 3.

Then COMPOSE can be one of the smart operations that when it is splicing in arrays, is able to handle the discontiguousness. As it walks it can be sensitive to the splices and follow them.

I'm not proposing that it become overcomplicated to the point where you see everything done with abstract iterators. Most of the system will ask for ARR_HEAD(...) of an array, and at that time splices would be reconciled. But I imagine a handful of operations being aware would make a big difference, just think INSERT/CHANGE/APPEND, REDUCE/COMPOSE, and the XXX-EACH operations.

Hopefully between that and the gambling on picking a right size working out, this would be a large win. And it should solve the issue of data stack size.

metaeducation / ren-c

READ/LINES causes data stack overflow with more than 400,000 lines #679

Background

New Plan