justinethier / cyclone

:cyclone: A brand-new compiler that allows practical application development using R7RS Scheme. We provide modern features and a stable system capable of generating fast native binaries.
http://justinethier.github.io/cyclone/
MIT License
823 stars 42 forks source link

Memory usage and tail calls #534

Open yorickhardy opened 5 months ago

yorickhardy commented 5 months ago

Hello,

I am trying out (srfi 18) in cyclone and noticed that memory was being "consumed" very quickly by a simple monitoring thread, the following small example seems to exhibit this behaviour:

(define (busy x) (busy (+ x 1)))
(busy 0)

Similarly, the cyclone compiler can be encouraged to use excessive memory (and compile time) by:

(define (compile-forever x) x (compile-forever x))
(compile-forever 0)

Both examples work as expected in icyc (with regards to memory use).

Am I doing something unreasonable here? I am using cyclone-0.36.0 on NetBSD 10.99.10 amd64 (current-ish).

justinethier commented 5 months ago

Hi @yorickhardy. First of all, thanks for the report! These both seem like real issues, let me find some time this week to take a closer look.

justinethier commented 4 months ago

Regarding:

(define (busy x) (busy (+ x 1)))
(busy 0)

Suspect this is related to the amount of memory added to a thread after major GC. Need to look into this more.

Regarding the compile-forever example, there is a problem with the CPS optimization beta expansion phase where it is failing to detect a recursive call, and executing forever. Looks like the problem (or at least part of it) is that analyze:find-recursive-calls does not properly handle edge cases where an ast:lambda call is made, or later on when the optimizer changes such calls to use Cyc-seq, EG:

(analyze:find-recursive-calls                                                    
  scan                                                                           
  app                                                                            
  (Cyc-seq x$1$2 (compile-forever k$5 x$1$2))) 
yorickhardy commented 4 months ago

Thanks!

I am sorry that I have not contributed any fixes, I am going to (eventually) try to put more time into understanding cyclone and help where/if I can.

justinethier commented 4 months ago

No worries, and bug reports are always appreciated! I would welcome fixes as well, however, these two issues in particular are in areas that would be difficult to track down... and I still need to investigate the first one :)

justinethier commented 4 months ago

As this runs, + will eventually cause a conversion from fixnum integer to a bignum. I suspect that cyclone is inefficiently allocating memory for bignums in this use case and that is the cause of the high memory usage.

Consider the memory usage when using fixnums or doubles:

(import (srfi 143))                                                             
(define (busy x) (busy (fx+ x 1)))                                              
(busy 0)    
(define (busy x) (busy (+ x 1.0)))                                              
(busy 0.0)    

That said, I suspect we can do better here, especially since the interpreters can. Will need to spend time looking into this further.

justinethier commented 4 months ago

Snippet of code from Cyc_sum:

    } else if (tx == bignum_tag && ty == -1) { \                                 
        BIGNUM_CALL(mp_init(&bn_tmp2)); \                                        
        Cyc_int2bignum(obj_obj2int(y), &bn_tmp2); \                               
        BIGNUM_CALL(BN_OP(&(x->bignum_t.bn), &bn_tmp2, &(x->bignum_t.bn))); \    
        mp_clear(&bn_tmp2); \                                                    

Compare with code from Cyc_fast_sum used in compiled code:

  if (is_object_type(x) && type_of(x) == bignum_tag) {                           
    if (obj_is_int(y)) {                                                         
      mp_int bny;                                                                
      BIGNUM_CALL(mp_init(&bny));                                                
      Cyc_int2bignum(obj_obj2int(y), &bny);                                      
      alloc_bignum(data, bn);                                                    
      BIGNUM_CALL(mp_add(&bignum_value(x), &bny, &bignum_value(bn)));            
      mp_clear(&bny);                                                            
      return bn;

Questions: Why are we doing an allocation here but not above, and can we safely speed up / optimize the latter code?

yorickhardy commented 4 months ago

I did not realize that it had switched over to bignum! Thanks. I have more or less isolated the memory usage problems that I was encountering (firstly, I was using many short lived threads instead of a thread pool and assumed that thread-join would (eventually) garbage collect the thread: that assumption is false as far as I can tell). Here is another example with bounded(?) but large memory use:

;  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
; 9949 yorick    25    0   556M  425M CPU/1       0:17 97.91% 56.10% test
(define (loop-test)
  (let ( (o (open-output-string)) )
    (display "abc" o)
    (close-output-port o)
    (loop-test) ) )

(loop-test)

and similarly with constantly growing memory use:

;  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
; 3563 yorick    26    0    13G   10G CPU/0       0:49   164%   128% test

(define output-port (make-parameter #f))

(define (loop-test)
  (parameterize ( (output-port (open-output-string)) )
    (display "abc" (output-port))
    (close-output-port (output-port))
    (loop-test) ) )

(loop-test)

which surprised me! This simple example grows a bit slower but in the same way:

(define a-string (make-parameter #f))

(define (loop-test)
 (parameterize ( (a-string "123") )
  (loop-test) ) )

(loop-test)

Strangely, icyc manages fine. I guess the use of parameterize here is not great since one can easily re-write this example to avoid the nested parameterizations.

I am not sure whether these examples are helpful, or just examples of poorly written scheme!

At this point I am not convinced that the remaining part is a valid issue, or poor programming on my part - so please close the issue if you are satisfied.

yorickhardy commented 4 months ago

Hello again,

The memory consumption that motivated this issue is all due to the use of many threads, via srfi-18 and Cyc_scm_call. I have switched to thread pools and Cyc_scm_call_no_gc and the memory consumption is now normal.

I was very surprised that terminated threads consume memory. Nevertheless, I don't think the issue is entirely valid as reported, since the underlying problem was threads (not GC and tail calls). Apologies if I have wasted too much of your time.

(I do appreciate that you have looked into the reported examples and that you are willing to investigate improvements.)

Thanks!

justinethier commented 4 months ago

Glad you got it working @yorickhardy!

I appreciate your feedback and think there are genuine issues that are being raised here, though I have not dug into your latest loop-test examples. Let me spend some time looking into everything you brought up to see what can be improved.

justinethier commented 4 months ago

@yorickhardy Do you have an example of thread-join not freeing up memory?

yorickhardy commented 4 months ago

Sure! I hope I have not missed anything obvious ...

(import (scheme base) (scheme write) (srfi 18))

(define (monitor n)
  (thread-yield!)
  n)

(define (wait-for-monitor next)
  (let ( (t (make-thread (lambda () (monitor next)))) )
    (thread-start! t)
    (thread-yield!)
    (thread-join! t)
    (wait-for-monitor (+ next 1.0)) ) )

(wait-for-monitor 0.0)