part-cw / lambdanative

LambdaNative is a cross-platform development environment written in Scheme, supporting Android, iOS, BlackBerry 10, OS X, Linux, Windows, OpenBSD, NetBSD, FreeBSD and OpenWrt.
http://www.lambdanative.org
Other
1.4k stars 86 forks source link

REDCAP parsing of json likely alters unicode #382

Closed karliwalti closed 4 years ago

karliwalti commented 4 years ago

When grabbing redcap data, unicode info is altered in one of the redcap functions, likely in (redcap:jsonstr->list str), more specifically in (json-decode str)

(redcap:data->string) is verbose with displaying redcap:data and str, add redcap to modules in DemoConsole : (display (redcap-export-metadata "redcap.ethz.ch" "XXX" 'format "json" 'forms "beweglichkeit"))

#u8(72 84 84 80 47 49 46 49 32 50 48 48 32 79 75 13 10 68 97 116 101 58 32 84 104 117 44 32 50 50 32 79 99 116 32 50 48 50 48 32 49 52 58 52 53 58 52 56 32 71 77 84 13 10 83 101 114 118 101 114 58 32 65 112 97 99 104 101 47 50 46 52 46 54 32 40 67 101 110 116 79 83 41 32 79 112 101 110 83 83 76 47 49 46 48 46 50 107 45 102 105 112 115 32 80 72 80 47 55 46 50 46 51 49 13 10 69 120 112 105 114 101 115 58 32 48 13 10 99 97 99 104 101 45 99 111 110 116 114 111 108 58 32 110 111 45 115 116 111 114 101 44 32 110 111 45 99 97 99 104 101 44 32 109 117 115 116 45 114 101 118 97 108 105 100 97 116 101 13 10 80 114 97 103 109 97 58 32 110 111 45 99 97 99 104 101 13 10 88 45 88 83 83 45 80 114 111 116 101 99 116 105 111 110 58 32 49 59 32 109 111 100 101 61 98 108 111 99 107 13 10 88 45 67 111 110 116 101 110 116 45 84 121 112 101 45 79 112 116 105 111 110 115 58 32 110 111 115 110 105 102 102 13 10 65 99 99 101 115 115 45 67 111 110 116 114 111 108 45 65 108 108 111 119 45 79 114 105 103 105 110 58 32 42 13 10 82 69 68 67 97 112 45 82 97 110 100 111 109 45 84 101 120 116 58 32 114 68 98 70 78 89 110 116 105 72 78 74 100 72 74 86 104 111 69 86 56 118 65 114 54 119 119 82 50 65 65 77 101 113 51 50 71 55 105 81 90 50 85 74 120 13 10 67 111 110 116 101 110 116 45 76 101 110 103 116 104 58 32 57 51 49 13 10 67 111 110 116 101 110 116 45 84 121 112 101 58 32 97 112 112 108 105 99 97 116 105 111 110 47 106 115 111 110 59 32 99 104 97 114 115 101 116 61 117 116 102 45 56 13 10 13 10 91 123 34 102 105 101 108 100 95 110 97 109 101 34 58 34 114 101 99 111 114 100 95 105 100 34 44 34 102 111 114 109 95 110 97 109 101 34 58 34 98 101 119 101 103 108 105 99 104 107 101 105 116 34 44 34 115 101 99 116 105 111 110 95 104 101 97 100 101 114 34 58 34 34 44 34 102 105 101 108 100 95 116 121 112 101 34 58 34 116 101 120 116 34 44 34 102 105 101 108 100 95 108 97 98 101 108 34 58 34 82 101 99 111 114 100 32 73 68 34 44 34 115 101 108 101 99 116 95 99 104 111 105 99 101 115 95 111 114 95 99 97 108 99 117 108 97 116 105 111 110 115 34 58 34 34 44 34 102 105 101 108 100 95 110 111 116 101 34 58 34 34 44 34 116 101 120 116 95 118 97 108 105 100 97 116 105 111 110 95 116 121 112 101 95 111 114 95 115 104 111 119 95 115 108 105 100 101 114 95 110 117 109 98 101 114 34 58 34 34 44 34 116 101 120 116 95 118 97 108 105 100 97 116 105 111 110 95 109 105 110 34 58 34 34 44 34 116 101 120 116 95 118 97 108 105 100 97 116 105 111 110 95 109 97 120 34 58 34 34 44 34 105 100 101 110 116 105 102 105 101 114 34 58 34 34 44 34 98 114 97 110 99 104 105 110 103 95 108 111 103 105 99 34 58 34 34 44 34 114 101 113 117 105 114 101 100 95 102 105 101 108 100 34 58 34 34 44 34 99 117 115 116 111 109 95 97 108 105 103 110 109 101 110 116 34 58 34 34 44 34 113 117 101 115 116 105 111 110 95 110 117 109 98 101 114 34 58 34 34 44 34 109 97 116 114 105 120 95 103 114 111 117 112 95 110 97 109 101 34 58 34 34 44 34 109 97 116 114 105 120 95 114 97 110 107 105 110 103 34 58 34 34 44 34 102 105 101 108 100 95 97 110 110 111 116 97 116 105 111 110 34 58 34 34 125 44 10 123 34 102 105 101 108 100 95 110 97 109 101 34 58 34 116 101 115 116 102 105 101 108 100 34 44 34 102 111 114 109 95 110 97 109 101 34 58 34 98 101 119 101 103 108 105 99 104 107 101 105 116 34 44 34 115 101 99 116 105 111 110 95 104 101 97 100 101 114 34 58 34 34 44 34 102 105 101 108 100 95 116 121 112 101 34 58 34 116 101 120 116 34 44 34 102 105 101 108 100 95 108 97 98 101 108 34 58 34 69 115 32 103 105 98 116 32 118 101 114 115 99 104 105 101 100 101 110 101 32 92 117 48 48 100 99 98 117 110 103 101 110 32 119 101 108 99 104 101 32 100 117 114 99 104 103 101 102 92 117 48 48 102 99 104 114 116 32 119 101 114 100 101 110 46 34 44 34 115 101 108 101 99 116 95 99 104 111 105 99 101 115 95 111 114 95 99 97 108 99 117 108 97 116 105 111 110 115 34 58 34 34 44 34 102 105 101 108 100 95 110 111 116 101 34 58 34 34 44 34 116 101 120 116 95 118 97 108 105 100 97 116 105 111 110 95 116 121 112 101 95 111 114 95 115 104 111 119 95 115 108 105 100 101 114 95 110 117 109 98 101 114 34 58 34 34 44 34 116 101 120 116 95 118 97 108 105 100 97 116 105 111 110 95 109 105 110 34 58 34 34 44 34 116 101 120 116 95 118 97 108 105 100 97 116 105 111 110 95 109 97 120 34 58 34 34 44 34 105 100 101 110 116 105 102 105 101 114 34 58 34 34 44 34 98 114 97 110 99 104 105 110 103 95 108 111 103 105 99 34 58 34 34 44 34 114 101 113 117 105 114 101 100 95 102 105 101 108 100 34 58 34 34 44 34 99 117 115 116 111 109 95 97 108 105 103 110 109 101 110 116 34 58 34 34 44 34 113 117 101 115 116 105 111 110 95 110 117 109 98 101 114 34 58 34 34 44 34 109 97 116 114 105 120 95 103 114 111 117 112 95 110 97 109 101 34 58 34 34 44 34 109 97 116 114 105 120 95 114 97 110 107 105 110 103 34 58 34 34 44 34 102 105 101 108 100 95 97 110 110 111 116 97 116 105 111 110 34 58 34 34 125 93 72 84 84 80 47 49 46 49 32 52 48 48 32 66 97 100 32 82 101 113 117 101 115 116 13 10 68 97 116 101 58 32 84 104 117 44 32 50 50 32 79 99 116 32 50 48 50 48 32 49 52 58 52 53 58 52 56 32 71 77 84 13 10 83 101 114 118 101 114 58 32 65 112 97 99 104 101 47 50 46 52 46 54 32 40 67 101 110 116 79 83 41 32 79 112 101 110 83 83 76 47 49 46 48 46 50 107 45 102 105 112 115 32 80 72 80 47 55 46 50 46 51 49 13 10 67 111 110 116 101 110 116 45 76 101 110 103 116 104 58 32 50 50 54 13 10 67 111 110 110 101 99 116 105 111 110 58 32 99 108 111 115 101 13 10 67 111 110 116 101 110 116 45 84 121 112 101 58 32 116 101 120 116 47 104 116 109 108 59 32 99 104 97 114 115 101 116 61 105 115 111 45 56 56 53 57 45 49 13 10 13 10 60 33 68 79 67 84 89 80 69 32 72 84 77 76 32 80 85 66 76 73 67 32 34 45 47 47 73 69 84 70 47 47 68 84 68 32 72 84 77 76 32 50 46 48 47 47 69 78 34 62 10 60 104 116 109 108 62 60 104 101 97 100 62 10 60 116 105 116 108 101 62 52 48 48 32 66 97 100 32 82 101 113 117 101 115 116 60 47 116 105 116 108 101 62 10 60 47 104 101 97 100 62 60 98 111 100 121 62 10 60 104 49 62 66 97 100 32 82 101 113 117 101 115 116 60 47 104 49 62 10 60 112 62 89 111 117 114 32 98 114 111 119 115 101 114 32 115 101 110 116 32 97 32 114 101 113 117 101 115 116 32 116 104 97 116 32 116 104 105 115 32 115 101 114 118 101 114 32 99 111 117 108 100 32 110 111 116 32 117 110 100 101 114 115 116 97 110 100 46 60 98 114 32 47 62 10 60 47 112 62 10 60 47 98 111 100 121 62 60 47 104 116 109 108 62 10)

HTTP/1.1 200 OK
Date: Thu, 22 Oct 2020 14:45:48 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/7.2.31
Expires: 0
cache-control: no-store, no-cache, must-revalidate
Pragma: no-cache
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Access-Control-Allow-Origin: *
REDCap-Random-Text: rDbFNYntiHNJdHJVhoEV8vAr6wwR2AAMeq32G7iQZ2UJx
Content-Length: 931
Content-Type: application/json; charset=utf-8

[{"field_name":"record_id","form_name":"beweglichkeit","section_header":"","field_type":"text","field_label":"Record ID","select_choices_or_calculations":"","field_note":"","text_validation_type_or_show_slider_number":"","text_validation_min":"","text_validation_max":"","identifier":"","branching_logic":"","required_field":"","custom_alignment":"","question_number":"","matrix_group_name":"","matrix_ranking":"","field_annotation":""},
{"field_name":"testfield","form_name":"beweglichkeit","section_header":"","field_type":"text","field_label":"Es gibt verschiedene \u00dcbungen welche durchgef\u00fchrt werden.","select_choices_or_calculations":"","field_note":"","text_validation_type_or_show_slider_number":"","text_validation_min":"","text_validation_max":"","identifier":"","branching_logic":"","required_field":"","custom_alignment":"","question_number":"","matrix_group_name":"","matrix_ranking":"","field_annotation":""}]HTTP/1.1 400 Bad Request
Date: Thu, 22 Oct 2020 14:45:48 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/7.2.31
Content-Length: 226
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
</p>
</body></html>

(((field_name . record_id) (form_name . beweglichkeit) (section_header . ) (field_type . text) (field_label . Record ID) (select_choices_or_calculations . ) (field_note . ) (text_validation_type_or_show_slider_number . ) (text_validation_min . ) (text_validation_max . ) (identifier . ) (branching_logic . ) (required_field . ) (custom_alignment . ) (question_number . ) (matrix_group_name . ) (matrix_ranking . ) (field_annotation . )) ((field_name . testfield) (form_name . beweglichkeit) (section_header . ) (field_type . text) (field_label . Es gibt verschiedene �bungen welche durchgef�hrt werden.) (select_choices_or_calculations . ) (field_note . ) (text_validation_type_or_show_slider_number . ) (text_validation_min . ) (text_validation_max . ) (identifier . ) (branching_logic . ) (required_field . ) (custom_alignment . ) (question_number . ) (matrix_group_name . ) (matrix_ranking . ) (field_annotation . )))

This means \u00fc is not converted correctly

Maybe a function like (u8vector->unicode u) is needed instead of (u8vector->string u ) ?

mgorges commented 4 years ago

What you are really asking is for something that changes escaped unicode characters into unicode characters, i.e. to replace \u00fc or (u8vector 92 117 48 48 100 99) with Ü or (u8vector 220). I believe the UTF8 implementation we have does things like (utf8string->unicode "\xc3;\xbc;") is (252) and conversely (unicode->utf8string (list 252)) is "\xc3;\xbc;"

karliwalti commented 4 years ago

yes indeed it goes into this direction. I had originally written code to convert them to escaped characters but this is as described in #361 no longer supported.

mgorges commented 4 years ago

This may work?

(define (u8vector->unicode-vector invec)
  (let* ((inveclen (u8vector-length invec))
         (outvec (make-vector inveclen))
         (outveclen 0)
         (armed? #f))
    (let loop ((i 0))
      (if (fx= i inveclen)
        (subvector outvec 0 outveclen)
        (begin
          (if armed?
            (case (u8vector-ref invec i)
              ((117)
                (let loop ((k 0) (val 0))
                  (if (fx= k 4)
                    (begin
                      (vector-set! outvec outveclen val)
                      (set! outveclen (fx+ outveclen 1))
                      (set! i (fx+ i 4))
                    )
                    (loop (fx+ k 1) (fx+ val (* (expt 16 (fx- 3 k))
                      (let ((v (u8vector-ref invec (fx+ k i 1))))
                        (if (fx>= v 97) (fx- v 87) (fx- v 48))))))
                  )))
               ((92)
                 (vector-set! outvec outveclen (u8vector-ref invec i))
                 (set! outveclen (fx+ outveclen 1))
                 (vector-set! outvec outveclen (u8vector-ref invec i))
                 (set! outveclen (fx+ outveclen 1))
               )
               (else
                 (vector-set! outvec outveclen 92)
                 (set! outveclen (fx+ outveclen 1))
                 (vector-set! outvec outveclen (u8vector-ref invec i))
                 (set! outveclen (fx+ outveclen 1))
               )
            )
            (if (not (fx= (u8vector-ref invec i) 92)) (begin
              (vector-set! outvec outveclen (u8vector-ref invec i))
              (set! outveclen (fx+ outveclen 1))
            ))
          )
          (set! armed? (fx= (u8vector-ref invec i) 92))
          (loop (fx+ i 1))
        )
      )
    )))

(define (u8vector->unicode-string vec)
  (unicode->utf8string (vector->list (u8vector->unicode-vector (subu8vector vec 0
    (let loop ((i 0))
      (if (or (fx= i (u8vector-length vec)) (fx= (u8vector-ref vec i) 0)) i
      (loop (+ i 1)))
    ))))))
karliwalti commented 4 years ago

yes, it works!

#u8(72 84 84 80 47 49 46 49 32 50 48 48 32 79 75 13 10 68 97 116 101 58 32 70 114 105 44 32 50 51 32 79 99 116 32 50 48 50 48 32 48 57 58 52 50 58 48 50 32 71 77 84 13 10 83 101 114 118 101 114 58 32 65 112 97 99 104 101 47 50 46 52 46 54 32 40 67 101 110 116 79 83 41 32 79 112 101 110 83 83 76 47 49 46 48 46 50 107 45 102 105 112 115 32 80 72 80 47 55 46 50 46 51 49 13 10 69 120 112 105 114 101 115 58 32 48 13 10 99 97 99 104 101 45 99 111 110 116 114 111 108 58 32 110 111 45 115 116 111 114 101 44 32 110 111 45 99 97 99 104 101 44 32 109 117 115 116 45 114 101 118 97 108 105 100 97 116 101 13 10 80 114 97 103 109 97 58 32 110 111 45 99 97 99 104 101 13 10 88 45 88 83 83 45 80 114 111 116 101 99 116 105 111 110 58 32 49 59 32 109 111 100 101 61 98 108 111 99 107 13 10 88 45 67 111 110 116 101 110 116 45 84 121 112 101 45 79 112 116 105 111 110 115 58 32 110 111 115 110 105 102 102 13 10 65 99 99 101 115 115 45 67 111 110 116 114 111 108 45 65 108 108 111 119 45 79 114 105 103 105 110 58 32 42 13 10 82 69 68 67 97 112 45 82 97 110 100 111 109 45 84 101 120 116 58 32 52 72 97 105 104 90 84 119 69 112 78 73 120 115 86 118 115 78 114 99 57 122 54 101 65 87 103 86 83 90 90 71 67 73 118 84 75 73 56 81 67 82 68 117 106 74 69 56 72 77 71 85 75 88 115 86 57 77 109 103 97 56 80 13 10 67 111 110 116 101 110 116 45 76 101 110 103 116 104 58 32 57 51 49 13 10 67 111 110 116 101 110 116 45 84 121 112 101 58 32 97 112 112 108 105 99 97 116 105 111 110 47 106 115 111 110 59 32 99 104 97 114 115 101 116 61 117 116 102 45 56 13 10 13 10 91 123 34 102 105 101 108 100 95 110 97 109 101 34 58 34 114 101 99 111 114 100 95 105 100 34 44 34 102 111 114 109 95 110 97 109 101 34 58 34 98 101 119 101 103 108 105 99 104 107 101 105 116 34 44 34 115 101 99 116 105 111 110 95 104 101 97 100 101 114 34 58 34 34 44 34 102 105 101 108 100 95 116 121 112 101 34 58 34 116 101 120 116 34 44 34 102 105 101 108 100 95 108 97 98 101 108 34 58 34 82 101 99 111 114 100 32 73 68 34 44 34 115 101 108 101 99 116 95 99 104 111 105 99 101 115 95 111 114 95 99 97 108 99 117 108 97 116 105 111 110 115 34 58 34 34 44 34 102 105 101 108 100 95 110 111 116 101 34 58 34 34 44 34 116 101 120 116 95 118 97 108 105 100 97 116 105 111 110 95 116 121 112 101 95 111 114 95 115 104 111 119 95 115 108 105 100 101 114 95 110 117 109 98 101 114 34 58 34 34 44 34 116 101 120 116 95 118 97 108 105 100 97 116 105 111 110 95 109 105 110 34 58 34 34 44 34 116 101 120 116 95 118 97 108 105 100 97 116 105 111 110 95 109 97 120 34 58 34 34 44 34 105 100 101 110 116 105 102 105 101 114 34 58 34 34 44 34 98 114 97 110 99 104 105 110 103 95 108 111 103 105 99 34 58 34 34 44 34 114 101 113 117 105 114 101 100 95 102 105 101 108 100 34 58 34 34 44 34 99 117 115 116 111 109 95 97 108 105 103 110 109 101 110 116 34 58 34 34 44 34 113 117 101 115 116 105 111 110 95 110 117 109 98 101 114 34 58 34 34 44 34 109 97 116 114 105 120 95 103 114 111 117 112 95 110 97 109 101 34 58 34 34 44 34 109 97 116 114 105 120 95 114 97 110 107 105 110 103 34 58 34 34 44 34 102 105 101 108 100 95 97 110 110 111 116 97 116 105 111 110 34 58 34 34 125 44 10 123 34 102 105 101 108 100 95 110 97 109 101 34 58 34 116 101 115 116 102 105 101 108 100 34 44 34 102 111 114 109 95 110 97 109 101 34 58 34 98 101 119 101 103 108 105 99 104 107 101 105 116 34 44 34 115 101 99 116 105 111 110 95 104 101 97 100 101 114 34 58 34 34 44 34 102 105 101 108 100 95 116 121 112 101 34 58 34 116 101 120 116 34 44 34 102 105 101 108 100 95 108 97 98 101 108 34 58 34 69 115 32 103 105 98 116 32 118 101 114 115 99 104 105 101 100 101 110 101 32 92 117 48 48 100 99 98 117 110 103 101 110 32 119 101 108 99 104 101 32 100 117 114 99 104 103 101 102 92 117 48 48 102 99 104 114 116 32 119 101 114 100 101 110 46 34 44 34 115 101 108 101 99 116 95 99 104 111 105 99 101 115 95 111 114 95 99 97 108 99 117 108 97 116 105 111 110 115 34 58 34 34 44 34 102 105 101 108 100 95 110 111 116 101 34 58 34 34 44 34 116 101 120 116 95 118 97 108 105 100 97 116 105 111 110 95 116 121 112 101 95 111 114 95 115 104 111 119 95 115 108 105 100 101 114 95 110 117 109 98 101 114 34 58 34 34 44 34 116 101 120 116 95 118 97 108 105 100 97 116 105 111 110 95 109 105 110 34 58 34 34 44 34 116 101 120 116 95 118 97 108 105 100 97 116 105 111 110 95 109 97 120 34 58 34 34 44 34 105 100 101 110 116 105 102 105 101 114 34 58 34 34 44 34 98 114 97 110 99 104 105 110 103 95 108 111 103 105 99 34 58 34 34 44 34 114 101 113 117 105 114 101 100 95 102 105 101 108 100 34 58 34 34 44 34 99 117 115 116 111 109 95 97 108 105 103 110 109 101 110 116 34 58 34 34 44 34 113 117 101 115 116 105 111 110 95 110 117 109 98 101 114 34 58 34 34 44 34 109 97 116 114 105 120 95 103 114 111 117 112 95 110 97 109 101 34 58 34 34 44 34 109 97 116 114 105 120 95 114 97 110 107 105 110 103 34 58 34 34 44 34 102 105 101 108 100 95 97 110 110 111 116 97 116 105 111 110 34 58 34 34 125 93 72 84 84 80 47 49 46 49 32 52 48 48 32 66 97 100 32 82 101 113 117 101 115 116 13 10 68 97 116 101 58 32 70 114 105 44 32 50 51 32 79 99 116 32 50 48 50 48 32 48 57 58 52 50 58 48 50 32 71 77 84 13 10 83 101 114 118 101 114 58 32 65 112 97 99 104 101 47 50 46 52 46 54 32 40 67 101 110 116 79 83 41 32 79 112 101 110 83 83 76 47 49 46 48 46 50 107 45 102 105 112 115 32 80 72 80 47 55 46 50 46 51 49 13 10 67 111 110 116 101 110 116 45 76 101 110 103 116 104 58 32 50 50 54 13 10 67 111 110 110 101 99 116 105 111 110 58 32 99 108 111 115 101 13 10 67 111 110 116 101 110 116 45 84 121 112 101 58 32 116 101 120 116 47 104 116 109 108 59 32 99 104 97 114 115 101 116 61 105 115 111 45 56 56 53 57 45 49 13 10 13 10 60 33 68 79 67 84 89 80 69 32 72 84 77 76 32 80 85 66 76 73 67 32 34 45 47 47 73 69 84 70 47 47 68 84 68 32 72 84 77 76 32 50 46 48 47 47 69 78 34 62 10 60 104 116 109 108 62 60 104 101 97 100 62 10 60 116 105 116 108 101 62 52 48 48 32 66 97 100 32 82 101 113 117 101 115 116 60 47 116 105 116 108 101 62 10 60 47 104 101 97 100 62 60 98 111 100 121 62 10 60 104 49 62 66 97 100 32 82 101 113 117 101 115 116 60 47 104 49 62 10 60 112 62 89 111 117 114 32 98 114 111 119 115 101 114 32 115 101 110 116 32 97 32 114 101 113 117 101 115 116 32 116 104 97 116 32 116 104 105 115 32 115 101 114 118 101 114 32 99 111 117 108 100 32 110 111 116 32 117 110 100 101 114 115 116 97 110 100 46 60 98 114 32 47 62 10 60 47 112 62 10 60 47 98 111 100 121 62 60 47 104 116 109 108 62 10)
HTTP/1.1 200 OK
Date: Fri, 23 Oct 2020 09:42:02 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/7.2.31
Expires: 0
cache-control: no-store, no-cache, must-revalidate
Pragma: no-cache
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Access-Control-Allow-Origin: *
REDCap-Random-Text: 4HaihZTwEpNIxsVvsNrc9z6eAWgVSZZGCIvTKI8QCRDujJE8HMGUKXsV9Mmga8P
Content-Length: 931
Content-Type: application/json; charset=utf-8

[{"field_name":"record_id","form_name":"beweglichkeit","section_header":"","field_type":"text","field_label":"Record ID","select_choices_or_calculations":"","field_note":"","text_validation_type_or_show_slider_number":"","text_validation_min":"","text_validation_max":"","identifier":"","branching_logic":"","required_field":"","custom_alignment":"","question_number":"","matrix_group_name":"","matrix_ranking":"","field_annotation":""},
{"field_name":"testfield","form_name":"beweglichkeit","section_header":"","field_type":"text","field_label":"Es gibt verschiedene Übungen welche durchgeführt werden.","select_choices_or_calculations":"","field_note":"","text_validation_type_or_show_slider_number":"","text_validation_min":"","text_validation_max":"","identifier":"","branching_logic":"","required_field":"","custom_alignment":"","question_number":"","matrix_group_name":"","matrix_ranking":"","field_annotation":""}]HTTP/1.1 400 Bad Request
Date: Fri, 23 Oct 2020 09:42:02 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/7.2.31
Content-Length: 226
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
</p>
</body></html>
(((field_name . record_id) (form_name . beweglichkeit) (section_header . ) (field_type . text) (field_label . Record ID) (select_choices_or_calculations . ) (field_note . ) (text_validation_type_or_show_slider_number . ) (text_validation_min . ) (text_validation_max . ) (identifier . ) (branching_logic . ) (required_field . ) (custom_alignment . ) (question_number . ) (matrix_group_name . ) (matrix_ranking . ) (field_annotation . )) ((field_name . testfield) (form_name . beweglichkeit) (section_header . ) (field_type . text) (field_label . Es gibt verschiedene Übungen welche durchgeführt werden.) (select_choices_or_calculations . ) (field_note . ) (text_validation_type_or_show_slider_number . ) (text_validation_min . ) (text_validation_max . ) (identifier . ) (branching_logic . ) (required_field . ) (custom_alignment . ) (question_number . ) (matrix_group_name . ) (matrix_ranking . ) (field_annotation . )))
karliwalti commented 4 years ago

So this would be the new function in redcap.scm https://github.com/part-cw/lambdanative/blob/a665d263d7e4efc43d4a7d0a313c33e3c46f9eb2/modules/redcap/redcap.scm#L53

(define (redcap:data->string)
  (let ((str (u8vector->unicode-string (subu8vector redcap:data 0 redcap:datalen))))
    (redcap:data-clear!)
    str))
mgorges commented 4 years ago

Please check if f06959c works or breaks in some other place. Thanks.

karliwalti commented 4 years ago

Unfortunately we are facing following issue now:

[ERROR] 2020-10-23 13:25:41: primordial: (string-append "H" "T" "T" "P" "/" "1" "." "1" " " "2" "0" "0" " " "O" "K" ...): Number of arguments exceeds implementation limit
[ERROR] 2020-10-23 13:25:41: trace: /home/walterk/SoftwareRepo/lambdanative/modules/redcap/redcap.scm line=54 col=14
[ERROR] 2020-10-23 13:25:41: trace: /home/walterk/SoftwareRepo/lambdanative/modules/redcap/redcap.scm line=201 col=60
karliwalti commented 4 years ago

if it helps, this is the strategy I took:

(define (handle-unicode-batch str)
  (log-status "Start converting new unicode as batch")
  (let* ((lstr (string->list str))
         (istr (map char->integer lstr))
         (isuni (map (lambda (m) (if (fx> m 127) 1 0)) istr)))
   (let loop ((prev 0) (next (list-pos isuni 1)) (is isuni) (output ""))
      (if (and is next)
        (let* ((charcode (list-ref istr (+ prev next)))
               (oct (number->octal charcode))
               (ab (substring  oct 1 3))
               (lowchar (octal->number ab))
               (tail (if (fx> (+ next 1) (length is)) '() (list-tail is (+ next 1))))
               (chunk (list->string (sublist lstr prev (+ prev next)))))
              (loop (+ prev next 1) (list-pos tail 1) tail (string-append output chunk (string (integer->char 92)) "30" (substring oct 0 1) (integer->utf8char  lowchar))))
        ;; When done the string, set back into it
        (begin (display prev)  (set! str (string-append output (list->string (list-tail lstr prev ))))))))  
   (log-status "Finished converting new unicode as batch") 
  str
)
mgorges commented 4 years ago

The error is in the call (apply string-append (map unicode->utf8string src))) - modules/ln_core/utf8string.scm#L291 will look into it, but it comes from the layer above. It passes the unit test, but I guess the string is too long.

mgorges commented 4 years ago

I have no idea on the performance impact of this change, but according to the gambit-list a way to work around the 'Number of arguments exceeds implementation limit' problem is to change apply to fold-right, i.e. make modules/ln_core/utf8string.scm#L291 into (else (fold-right string-append "" (map unicode->utf8string src))) - I believe it is quite bad, in a quick test of 100,000 iterations on a 1500 word string, it makes it about 33 times slower: 204sec vs. 6 sec.

mgorges commented 4 years ago

@0-8-15 might have some ideas with respect to the performance of changing the apply string-append into fold-right string-append "" or we need to implement this in C? @karliwalti on the REDCap side, we could also see if this can be applied only to the body responses, not the header, and need to verify that this is not applied to any binary files?

0-8-15 commented 4 years ago

@0-8-15 might have some ideas with respect to the performance of changing the apply string-append into fold-right string-append "" or we need to implement this in C?

NB: Just out of the head. Nothing tested, nor did I completely understand the issue.

As we are looking into a string-append situation, how about

(call-with-output-string
 (lambda (port)
  (for-each
    (lambda (element) (display (CONVERT element) port))
    INPUT-LIST))))

This should avoid C and still give a reasonable result.

To try with the example (still untested)

(call-with-output-string
 (lambda (port)
  (for-each
    (lambda (element) (display (unicode->utf8string element) port))
    src))))

Does this help?

mgorges commented 4 years ago

The call-with-output-string approach is a bit faster, 104.6sec vs. 5.8sec, but I haven't verified if the output is identical yet.

mgorges commented 4 years ago

To explain the issue a bit more the problem is that we do conversions of u8vectors to strings for very large strings as they are returned as part of a website response to a query. The problem occurs in modules/ln_core/utf8string.scm#L291. For me the limit is 8193 elements, but your milage may vary.

mgorges commented 4 years ago

Added in 6a63dea - the alternative is to make this conditional on (length src) and only for long strings to apply the new approach?

0-8-15 commented 4 years ago

Hm. 50% win over the worst case and still 20x the best case is not much of a win.

Late today, desktop is already off. The other alternative is roughly:

(define (concatenate-large-string-list lst)
 (let ((len (let loop ((i 0) (lst lst))
                  (if (null? lst) i
                      (loop
                        (+ i (string-length (car lst)))
                       (cdr lst)))))
          (result (make-string len)))
    (do ((lst lst (cdr lst)) (offset 0 (+ offset (string-length (car lst)))))
          ((null?  lst) result))
       (string-copy! result offset (car lst)))))

Note: that's just typed into the browser window here. Get the idea, but don't trust it for correctness.

mgorges commented 4 years ago

Putting some brackets in the right place this brings it down to 10.7 sec in my benchmark. Will do more testing but this looks promising.

mgorges commented 4 years ago

Not sure this is great code, but it is yet another 2sec faster at 8.1 sec in my benchmark:

(define (concatenate-large-string-list2 lst)
  (define len 0)
  (for-each (lambda (l) (set! len (fx+ len (string-length l)))) lst)
  (let ((offset 0)
        (result (make-string len)))
    (for-each (lambda (l)
       (string-copy! result offset l)
       (set! offset (fx+ offset (string-length l)))
    ) lst)
    result
  ))
mgorges commented 4 years ago

@karliwalti please check with 8a65b9a to see if it still works and has acceptable performance. If so, you can close this ticket.

karliwalti commented 4 years ago

I get a *** WARNING -- "string-copy!" is not defined, at compile under linux. Removing the exclamation mark leads to error: (string-copy: Wrong number of arguments passed to procedure

mgorges commented 4 years ago

Great - another one not defined in 4.7.9 ... it was added in https://github.com/gambit/gambit/commit/e2ef9dce0e95dd081752db8af423e94c1bfe66c5, which btw. seems to have unicode conversion already build in, as gambit/gambit:/lib/gambit/prim/string-r7rs%23.scm , introduced in https://github.com/gambit/gambit/commit/98ca3b41d2a9f86328f1136009f5bf965f1f0f58 and https://github.com/gambit/gambit/commit/29a8a4e1d923586a73a72fabdc1fb13543127ec6 which defines string->utf8 and utf8->string? Will need to find a different solution ...

mgorges commented 4 years ago

Fine, try again with 080e10c which replaced the string-copy! with another for-each using string-set! which exists in the older version too.

0-8-15 commented 4 years ago

Not sure this is great code, but it is yet another 2sec faster at 8.1 sec in my benchmark:

That's interesting. Now, a let-loop could reasonably be slower than a do. Out of curiosity, could you try your benchmark with this version:

(define (concatenate-large-string-list3 lst)
  (let* ((len (do ((i 0 (fx+ i (string-length (car lst))))
                   (lst lst (cdr lst)))
                  ((null? lst) i)))
         (result (make-string len)))
    (do ((lst lst (cdr lst))
         (offset 0 (fx+ offset (string-length (car lst)))))
        ((null? lst) result)
      (string-copy! result offset (car lst)))))

I'd like to learn whether or not the for-each is still faster tan the do.

mgorges commented 4 years ago

The new do approach you provided is 7.3 sec, while the for-each approach with string-set! is 8.5 sec. However, if I also replace the string-copy! with yet another do loop, as we have to use string-set! instead, it increases to 8.0 sec so a touch faster? But these timings do vary a little bit between runs on my CPU.

mgorges commented 4 years ago

@karliwalti other than considering to do another round of optimization of list->utf8string can you confirm that this works for you now?

karliwalti commented 4 years ago

I just finished my tests. It works as intended. I only tested with German texts.

The timing does not seem to be a serious issue in my applications

karliwalti commented 4 years ago

Thank you very much for changing this. I think it can be closed now

mgorges commented 4 years ago

ce98015 has the version with three dos in it; it's about 15% better than what we had previously.