jython / jython

Python for the Java Platform
https://www.jython.org
Other
1.25k stars 193 forks source link

Filereader: TypeError: read(): 1st arg can't be coerced to java.nio.CharBuffer, char[] #236

Open wupz2k opened 1 year ago

wupz2k commented 1 year ago

Works in 2.7.2..& before. But fails in 2.7.3

java -jar ./jython-standalone-2.7.3.jar
Jython 2.7.3 (tags/v2.7.3:5f29801fe, Sep 10 2022, 18:52:49)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_361
Type "help", "copyright", "credits" or "license" for more information.
>>> from java.io import FileReader
>>> import jarray
>>> reader = FileReader('./test1.txt')
>>> chars = jarray.zeros(2048, 'c')
>>> num = reader.read(chars)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: read(): 1st arg can't be coerced to java.nio.CharBuffer, char[]
>>> exit()
jeff5 commented 1 year ago

I believe this is a consequence of fixing #119. 'c' should be byte-sized, and now is.

It is reasonable to ask how the same effect (a bulk read of UTF-16 code points) should be achieved now.

jeff5 commented 1 year ago

I think there is a case for implementing 'c' type of array element with char, but it is difficult to predict the impact elsewhere. Previously it was implemented that way, but then the array module had to lie about its item size in order to meet expectations elsewhere that the array should contain bytes. I think they would have to read out as unicode.

If we read some non-ascii text through a FileReader in 2.7.2 we get the file contents through the default encoding. On my machine (WIndows) it is not UTF-8, and so I have to take the trouble to decode the file correctly. They arrive in the array as Java chars, but reading them out is problematic.

PS 236> jython
Jython 2.7.2 (v2.7.2:925a3cc3b49d, Mar 21 2020, 10:03:58)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_321
Type "help", "copyright", "credits" or "license" for more information.
>>> from java.io import FileReader, FileInputStream, InputStreamReader
>>> from java.nio.charset import Charset
>>> import jarray
>>> reader = InputStreamReader(FileInputStream('greek.txt'), Charset.forName('UTF-8'))
>>> reader.read(chars)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'chars' is not defined
>>> chars =  jarray.zeros(40, 'c')
>>> reader.read(chars)
11
>>> k = chars[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
java.lang.IllegalArgumentException: Cannot create PyString with non-byte value

In the particular application, I think the file should be treated as bytes.

PS Jython-2> dist\bin\jython
Jython 2.7.4a1-DEV (heads/master:c0a5d43f3, Jul 10 2023, 10:20:01)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_321
Type "help", "copyright", "credits" or "license" for more information.
>>> from java.io import FileReader, FileInputStream, InputStreamReader
>>> from java.nio.charset import Charset
>>> import jarray
>>> reader = FileInputStream('greek.txt')
>>> buf =  jarray.zeros(40, 'b')
>>> reader.read(buf)
21

And then optionally decoded to characters with the correct codec.

>>> from java.nio import ByteBuffer
>>> u = Charset.forName('UTF-8').decode(ByteBuffer.wrap(buf.tostring()))
>>> repr(u)
u'\u03ba\u03c5\u03b2\u03b5\u03c1\u03bd\u03ae\u03c4\u03b7\u03c2\n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'```

This would all be more usable if PyArray were coercible to ByteBuffer but it's late to be adding features.

ericmshort commented 9 months ago

I have run into this same issue on 2.7.3 and have been trying to work around it. If I use FileInputStream, it works without any errors. If I use InputStreamReader(FileInputStream(...)) it will throw the error. Same with BufferedReader. I then discovered that the error is thrown regardless of if I use 'c' or 'b' in my buffer.

jeff5 commented 8 months ago

@ericmshort : you can read into a CharBuffer.

Perhaps there should be an element type in array that makes this usage possible.

'u' has meant UTF-32 for a while, so maybe 'U' for UTF-16? The other way round is more intuitive, but less compatible with previous behaviour. Bit of work to get right.

jeff5 commented 8 months ago

Given that the use case (reading a file through java.io to Python) can be served by other means, I don't think this is pressing for 2.7.4. Going via UTF-16 psuedo-characters in Python is not necessarily helpful.