stefan-m-lenz / JuliaConnectoR

A functionally oriented interface for calling Julia from R
Other
100 stars 6 forks source link

Detection of NAs in double is too strict, NA_real_ + 0 not recognized #10

Closed gmbecker closed 2 years ago

gmbecker commented 2 years ago

R defines a REALSXP (ie double) NA as any element with an NaN whose payload is 1957. This notably does not include a specification for the value at the signalling bit which differentiated quiet NaNs from signalling NaNs. In fact, which of these a double typed NA in R is is somewhat unpredictable credit to @brodieG:

example(numToBits) ## for bitC

(output omitted for brevity)

> bitC(NA_real_)
[1] 0 11111111111 | 0000000000000000000000000000000000000000011110100010
> bitC(NA_real_ + 0)
[1] 0 11111111111 | 1000000000000000000000000000000000000000011110100010

Note that both of these values are completely valid NAs in R, but they are bitwise different. You currently only check for bitwise identicallity to the first "pristine" NA value, thus if you have "dirty" NAs in numeric vectors we do not get the right behavior:

> juliaLet("ismissing(x)", x = NA_real_ + 0)
[1] FALSE

This is because it is coming through as an NaN float64 value, rather than being flagged as needing the union with missing and having the missing value.

stefan-m-lenz commented 2 years ago

Thanks, this is a very interesting observation! I could reproduce this on Linux (not on Windows, though). Could you point me to the documentation that contains this exact definition of what is exactly treated as NA? ( @brodieG ?) I could not find the information in the R language definition. I think it would suffice to change the line https://github.com/stefan-m-lenz/JuliaConnectoR/blob/608591ef02b4a8a845c4e1752707246e4d222200/inst/Julia/reading.jl#L94 and apply a bitmask that fits this definition instead of comparing only with the NA_real_ value, this time with the correct definition of NA values in R.

gmbecker commented 2 years ago

A very cursory check of the manuals isn't surfacing it for me either, but we can see it in the source code:

https://github.com/wch/r-source/blob/trunk/src/main/arithmetic.c#L126-L137

In particular, the ieee_double union defines the .word interpretation as interpreting the bits that make up a double as two unsigned ints, so the check linked above checks if the "second"/"lower" (taking endianness into account) int is equal to 1957. Note the bit difference, ie the signaling bit for NaNs, is in the "First" or "higher" int, so this check does not notice which type of NaN our NA is.

Hope that helps.

stefan-m-lenz commented 2 years ago

Thanks for pointing to the source code. But the magic number there is 1954, not 1957. After a Google search for "r realsexp na 1954" I found some more information: https://github.com/hadley/r-internals/blob/master/vectors.md#special-values I will try to figure out the appropriate Julia code to emulate the R_IsNA function. If you have a suggestion for this, please let me know.

brodieG commented 2 years ago

The canonical reference is in the R-data manual. It documents the "low" word. This is consistent with the semantics of R_IsNA. And yes, the payload is 1954.

If that is not possible, on all current platforms IEC 60559 (aka IEEE 754) arithmetic is used, so standard C facilities can be used to test for or set Inf, -Inf and NaN values. On such platforms NA is represented by the NaN value with low-word 0x7a2 (1954 in decimal).

gmbecker commented 2 years ago

Ah you're right I wrote that initial message too fast my apologies. Of course that is why neither of us could find 1957.

On Wed, Oct 13, 2021, 6:48 AM Brodie Gaslam @.***> wrote:

The canonical reference is in the R-data manual https://cran.r-project.org/doc/manuals/r-devel/R-data.html#Special-values. It documents the "low" word. This is consistent with the semantics of R_IsNA. And yes, the payload is 1954.

If that is not possible, on all current platforms IEC 60559 (aka IEEE 754) arithmetic is used, so standard C facilities can be used to test for or set Inf, -Inf and NaN values. On such platforms NA is represented by the NaN value with low-word 0x7a2 (1954 in decimal).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stefan-m-lenz/JuliaConnectoR/issues/10#issuecomment-942326461, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG53MOKFTQNY4BD5N4Z54TUGWEZ7ANCNFSM5F3LM5XA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

stefan-m-lenz commented 2 years ago

I fixed this issue in the branch https://github.com/stefan-m-lenz/JuliaConnectoR/tree/FixIssueNo2. You can try it out via

devtools::install_github("stefan-m-lenz/JuliaConnectoR", ref= "FixIssueNo10")

Now:

> juliaLet("ismissing(x)", x = NA_real_ + 0)
[1] TRUE

You can see the code that emulates the R_IsNA function here: https://github.com/stefan-m-lenz/JuliaConnectoR/blob/cbb5bfc0194082ed1eff213e50d888b1654a1cb5/inst/Julia/reading.jl#L60-L82 This is then used here: https://github.com/stefan-m-lenz/JuliaConnectoR/blob/cbb5bfc0194082ed1eff213e50d888b1654a1cb5/inst/Julia/reading.jl#L117-L124

I think that should cover it for double values. I wasn't entirely sure whether the low-word is always 32bit or could also be 16 bit. @brodieG Do you know?

Before I merge it into master, I have to fix my tests, which are now broken due to minor issues in dependencies in a test of an example.

I also would like to fix the same issue for complex numbers but I haven't really gotten my head around this. NAs are handled even more differently in complex numbers. I would need a similar function represents_na(re::Float64, im::Float64). If you have a suggestion how to do this, please let me know.

gmbecker commented 2 years ago

Complex values in R are, essentially, two parallel real vectors, each of which have the numeric vector behavior, internally. Note that while is.na returns TRUE if either the real or imaginary portion is NA, any non-na value of the other component is not lost:

> x <- complex(real = c(5, NA, NA, 5), imaginary=c(5, 5, NA,  NA))
> is.na(x)
[1] FALSE  TRUE  TRUE  TRUE
> Re(x)
[1]  5 NA NA  5
> Im(x)
[1]  5  5 NA NA

So your existing functionality, provided it works for double-valued numeric vectors, should work fine, I expect, for complexes. You'll have to decide whether you're going to make the complex have both components be a union any time is.na would return true, or only if that component needs it, I suppose, but thats to be expected anyway i'd think?

brodieG commented 2 years ago

I think that should cover it for double values. I wasn't entirely sure whether the low-word is always 32bit or could also be 16 bit. @brodieG Do you know?

That's a good question. I don't think there is a canonical answer. Certainly the code assumes 32 bit based on using two 32 bit integers to access a 64 bit double, but word size is variable and I think currently in most common R platforms are actually 64 bit. Perhaps at the time this code was written 32 bits was a common word size (seems likely). There is also the question of whether integers can be assumed to be 32 bits (I believe this assumption is documented) an doubles 64 bits (this assumption is not documented AFAICT, but the R internal code does assume it).

This to say I would just assume word == 32bit and hope that remains true in the future. It likely will.

I don't know Julia at all, but some comments on the implementation:

 const magicnumber1954shifted = (UInt(1954) << 32) 

Should that be Uint64? Again, I know nothing about Julia, but it seems UInt could be 32 bits in some cases.

Separately: do you need to do anything about endianess? IIRC your original code used ntoh or some such which would have handled it. I think you're okay here, if we assume that like in C the shift operator is operating on the concept of binary value, but this stuff always give me heartburn so I wanted to raise it on the off chance you hadn't considered it.

gmbecker commented 2 years ago

Yeah, i had the same 😬 reaction to using the bit shift. like, I figured it was "probably ok" but I wasn't going to dig down into Julia's spec to make sure.

In terms of ints being possibly 16 bits, I believe this would make R not compile/work. As @brodieG alluded to, it is documented that int (and thus unsigned int) are 32 bits on all platforms R suppports, in the R internals manual ( https://cran.r-project.org/doc/manuals/r-release/R-ints.html#The-_0027data_0027):

INTSXP length, truelength followed by a block of C ints (which are 32 bits on all R platforms).

So the upper an lower word will each be 32 bits/4 bytes, always.

stefan-m-lenz commented 2 years ago
 const magicnumber1954shifted = (UInt(1954) << 32) 

Should that be Uint64? Again, I know nothing about Julia, but it seems UInt could be 32 bits in some cases.

Absolutely it should! Thanks for spotting this! UInt is 32-bit on 32-bit systems, indeed!

Thank you also for the clarification about the question on 32-bit vs 16-bit word length.

It is also good that you mention the endianness. I think, however, the solution with the shift operator is independent of endianness. The documentation of the Julia << operator says:

Left bit shift operator, x << n. For n >= 0, the result is x shifted left by n bits, filling with 0s. This is equivalent to x * 2^n.

Concerning the complex numbers, I have to think a little bit more what the best way of handling the NAs is. Thanks for your thoughts so far.

stefan-m-lenz commented 2 years ago

Complex values in R are, essentially, two parallel real vectors, each of which have the numeric vector behavior, internally. Note that while is.na returns TRUE if either the real or imaginary portion is NA, any non-na value of the other component is not lost:

> x <- complex(real = c(5, NA, NA, 5), imaginary=c(5, 5, NA,  NA))
> is.na(x)
[1] FALSE  TRUE  TRUE  TRUE
> Re(x)
[1]  5 NA NA  5
> Im(x)
[1]  5  5 NA NA

So your existing functionality, provided it works for double-valued numeric vectors, should work fine, I expect, for complexes. You'll have to decide whether you're going to make the complex have both components be a union any time is.na would return true, or only if that component needs it, I suppose, but thats to be expected anyway i'd think?

Thanks for this interesting example.

In contrast to R, a Julia Complex value cannot have one component missing and one defined. Both components need to be of the same Real type. This means, a 1:1 translation is not possible in all cases. If the JuliaConnectoR translated x[2] or x[4] in your example above to a missing value in Julia, the value of the non-missing part would be lost. Therefore I think R complex values with only one NA value in either real or imaginary part should not be translated to missing values in Julia.

I have modified the code for the complex values now in a similar way to the code with the double values:

https://github.com/stefan-m-lenz/JuliaConnectoR/blob/668a90b1d849fbbe31d0cf245d9bedf287011613/inst/Julia/reading.jl#L101-L110

With this now:

> dirtyNA <- NA_real_ + 0
> dirtyComplexNA <- complex(real = dirtyNA, imaginary = dirtyNA)
> juliaCall("ismissing", dirtyComplexNA)
[1] TRUE

If you like, let me know your thoughts. I am planning to release a vrsion 1.0.1 on CRAN soon with these updates.

stefan-m-lenz commented 2 years ago

I've merged the changes into master. I am still integrating some other changes before I create a 1.0.1 and put it on CRAN.

brodieG commented 2 years ago

Great. I don't use complex values enough, or know enough about Julia and missing values vs NaN to have any useful feedback, so I won't comment on your implementation.

stefan-m-lenz commented 2 years ago

Thanks for you feedback anyway! I would be really surprised if someone turns up that uses complex numbers with missing values and wants to interface with Julia from R, so I guess there will be no complaints about this.

gmbecker commented 2 years ago

You might consider throwing a warning when someone comes to the JuliaConnectoR interface with the type of Complex partially missing values that Julia has no way of supporting, rather than silently doing a lossy conversion.

On Tue, Oct 26, 2021 at 11:46 AM Stefan Lenz @.***> wrote:

Thanks for you feedback anyway! I would be really surprised if someone turns up that uses complex numbers with missing values and wants to interface with Julia from R, so I guess there will be no complaints about this.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stefan-m-lenz/JuliaConnectoR/issues/10#issuecomment-952213471, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG53MNPTMP2343NLBS3DX3UI4APTANCNFSM5F3LM5XA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

stefan-m-lenz commented 2 years ago

I am not doing a lossy conversion. The only thing is that a value such as NA + 1i in R is translated to a value that is not treated differently than NaN + im in Julia. But when translated back, the real value in this example will also be again NA in R.

stefan-m-lenz commented 2 years ago

The changes are integrated in release 1.1.0, which is also available from CRAN.