Open wupz2k opened 1 year ago
I believe this is a consequence of fixing #119. 'c'
should be byte-sized, and now is.
It is reasonable to ask how the same effect (a bulk read of UTF-16 code points) should be achieved now.
I think there is a case for implementing 'c'
type of array element with char
, but it is difficult to predict the impact elsewhere. Previously it was implemented that way, but then the array
module had to lie about its item size in order to meet expectations elsewhere that the array should contain bytes. I think they would have to read out as unicode
.
If we read some non-ascii text through a FileReader
in 2.7.2 we get the file contents through the default encoding. On my machine (WIndows) it is not UTF-8, and so I have to take the trouble to decode the file correctly. They arrive in the array as Java char
s, but reading them out is problematic.
PS 236> jython
Jython 2.7.2 (v2.7.2:925a3cc3b49d, Mar 21 2020, 10:03:58)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_321
Type "help", "copyright", "credits" or "license" for more information.
>>> from java.io import FileReader, FileInputStream, InputStreamReader
>>> from java.nio.charset import Charset
>>> import jarray
>>> reader = InputStreamReader(FileInputStream('greek.txt'), Charset.forName('UTF-8'))
>>> reader.read(chars)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'chars' is not defined
>>> chars = jarray.zeros(40, 'c')
>>> reader.read(chars)
11
>>> k = chars[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
java.lang.IllegalArgumentException: Cannot create PyString with non-byte value
In the particular application, I think the file should be treated as bytes.
PS Jython-2> dist\bin\jython
Jython 2.7.4a1-DEV (heads/master:c0a5d43f3, Jul 10 2023, 10:20:01)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_321
Type "help", "copyright", "credits" or "license" for more information.
>>> from java.io import FileReader, FileInputStream, InputStreamReader
>>> from java.nio.charset import Charset
>>> import jarray
>>> reader = FileInputStream('greek.txt')
>>> buf = jarray.zeros(40, 'b')
>>> reader.read(buf)
21
And then optionally decoded to characters with the correct codec.
>>> from java.nio import ByteBuffer
>>> u = Charset.forName('UTF-8').decode(ByteBuffer.wrap(buf.tostring()))
>>> repr(u)
u'\u03ba\u03c5\u03b2\u03b5\u03c1\u03bd\u03ae\u03c4\u03b7\u03c2\n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'```
This would all be more usable if PyArray
were coercible to ByteBuffer
but it's late to be adding features.
I have run into this same issue on 2.7.3 and have been trying to work around it. If I use FileInputStream, it works without any errors. If I use InputStreamReader(FileInputStream(...)) it will throw the error. Same with BufferedReader. I then discovered that the error is thrown regardless of if I use 'c' or 'b' in my buffer.
@ericmshort : you can read into a CharBuffer.
Perhaps there should be an element type in array
that makes this usage possible.
'u' has meant UTF-32 for a while, so maybe 'U' for UTF-16? The other way round is more intuitive, but less compatible with previous behaviour. Bit of work to get right.
Given that the use case (reading a file through java.io
to Python) can be served by other means, I don't think this is pressing for 2.7.4. Going via UTF-16 psuedo-characters in Python is not necessarily helpful.
Works in 2.7.2..& before. But fails in 2.7.3