Closed VolodyaCO closed 3 years ago
Do you know if you are using a version 2 or version 3 rdata file?
Unfortunately I don't
Could you show me the first bytes of the file? There should be a magic number there that can be used to infer the file format. Also, sometimes the file
Unix command can tell you that, if the format is known.
head Consulta.rda
��{�}K]� ���~�Q"�߽*0�d�h'�`f�!m�ı!3"��!c�1"�(�81������"c��0I��
f�!�T��`������1��g1��<��7`��M�'b����'c���̟�<�m�?��V̟���� ����k�]{����7?L�������SϪ]�ZU�j��]�Z���gF�gVR�T6�-`�XS�?��z�N�r~�S�T9v}�w������R�n��[���SO����a����|
�t�30ߎ�/0��1����1��|�/`�7�8}������l�_�������W���`��������^���>�����5�<�_��0�G�������ļ�C��0?�y!�~�/¼�#��`^������oa~���\�6��0�w1?�� ����$��c��
�r�?�������
敘�ż��k1���o��r�G�_���1��K��y=&NC��W0oļ)u*C�=�͘1o���6̯bގy�ż
�k�_Ǽ�{1�
������a�9惘�c~��|��C�c�|�Q�oa>��m��1�`>��$�����w0��/1��|�Y����O��+���9����
��*����t��p��M�1 �k̿��[������'���;��c>��a����0��0�o�`��������C -�X�P��brN�f��)a 7
<�{�>f�bF�1f��a�&�K��W`�C9LH�it ��y�i� ��с4:�F��@H��
�т4Z�F
H�i4 ��р4��с4:�F��@H��F�h@@ч4���? �hA-H�|��0���4̧a>�D�7c�>
�s��Ǡit ���!�u� �Z�F
�i�O�}��p��ل�4�a?��3p������Ӱ���4��>
file Consulta.rda
Consulta.rda: gzip compressed data, from HPFS filesystem (OS/2, NT), original size modulo 2^32 67449241
Ok, and can you unzip it with gzip? (Note that rdata
will do it automatically if the file is well formed, I only need you to do this manually because I want to inspect the header of the uncompressed file).
After unzipping:
file Consulta
Consulta: data
head -3 Consulta
RDX3
X
�$%&'(�b)*+,-./012345^6789Q:;<=>?@BCDEFGHRIJKLWMNOPSTUVXYZ[\]^_`abcdefgmhijknopqrstuvw}xyz{|~���������������������������������������������������������������������������������������������������������6����7����������������
Ok, so the magic number RDX3
is for version 3 of the RData format, which is very recent. Currently, this package only supports version 2.
I did not have enough time to look at the changes that version 3 brings to the data format. As far as I know, the format is VERY similar, but the header has a new field containing the default encoding of the strings. It should not be very difficult to modify the code to skip that part (or to use it as intended, for strings without encoding info), but currently I have not time for this.
So, two options here:
If you have access to an R installation, you can load the data and save it again specifying the version 2 format.
OR
You can try to understand the additions to the format and modify the parser to allow this format, maybe ignoring this new field as a first implementation. If you want to do that I will review your PR.
Ok, let me find if my team can export this to version 2 format. If not, I'll try to find the time to submit a PR.
I know I told you that I did not have time (and maybe I should have been doing other things), but I was bored tonight and... well, look at the feature/version3 branch ;-)
Please, tell me if that is enough for your use case, or you need full version 3 support.
After installing with pip install git+https://github.com/vnmabus/rdata.git@feature/version3
I obtained this:
>>> import rdata
>>> parsed = rdata.parser.parse_file('Consulta.rda')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 515, in parse_file
return parse_data(data)
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 600, in parse_data
return parse_data(gzip.decompress(data))
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 607, in parse_data
raise NotImplementedError("Unknown file type")
NotImplementedError: Unknown file type
I renamed .rda to .gz but the output didn't change
Hmm, that should not happen. Can you verify that the installation has been done properly? To do that please tell me if the following code executes without errors:
import rdata
print(rdata.parser._parser.FileTypes.rdata_binary_v3)
Yeah I did not install it properly. I uninstalled it and then installed it again. Now it's in the right version, but I'm still getting an error:
>>> import rdata
>>> print(rdata.parser._parser.FileTypes.rdata_binary_v3)
FileTypes.rdata_binary_v3
>>> parsed = rdata.parser.parse_file('Consulta.rda')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 545, in parse_file
return parse_data(data)
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 631, in parse_data
return parse_data(gzip.decompress(data))
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 636, in parse_data
return parse_rdata_binary(view)
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 652, in parse_rdata_binary
return parser.parse_all()
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 267, in parse_all
obj = self.parse_R_object()
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 343, in parse_R_object
car = self.parse_R_object(reference_list)
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 396, in parse_R_object
value[i] = self.parse_R_object(reference_list)
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 396, in parse_R_object
value[i] = self.parse_R_object(reference_list)
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 396, in parse_R_object
value[i] = self.parse_R_object(reference_list)
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 355, in parse_R_object
f"Length of CHAR cannot be {length}")
NotImplementedError: Length of CHAR cannot be -1
Interesting. A few weeks ago, in response to #1, I changed the code to consider that empty strings have always 0-length (I had -1 before, but I did not remember why, and in my tests the empty string has 0-length). I do not have any problem changing it again to accept strings with -1 length as empty, but I would like to know (if you have that info) how that string came to be, in order to add a proper test.
I don't have that info, sorry. It can be a null value of some field, maybe?
Well, that is unfortunate, but let's solve your problem first. I have changed the code to accept -1 length as the empty string. Please let me know if your data can be imported now, and if you miss some field. I will try to investigate the -1 length later.
I don't have that info, sorry. It can be a null value of some field, maybe?
It seems that you are right! R strings are nullable, and a NA
string is stored as a string with length -1. I have changed the code in the branch to parse that as None
in Python, and added a test for that. Please tell me if it works for you, in order to merge the PR.
It is parsing, but I cannot convert them apparently due to a unicode issue:
>>> import rdata
>>> parsed = rdata.parser.parse_file('Consulta.rda')
>>> converted = rdata.conversion.convert(parsed)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 617, in convert
return SimpleConverter(*args, **kwargs).convert(data)
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 454, in convert
return self._convert_next(data)
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 486, in _convert_next
value = convert_list(obj, self._convert_next)
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 75, in convert_list
return {tag: conversion_function(r_list.value[0]), **cdr}
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 521, in _convert_next
value = convert_vector(obj, self._convert_next, attrs=attrs)
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 161, in convert_vector
conversion_function(o) for o in r_vec.value
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 161, in <listcomp>
conversion_function(o) for o in r_vec.value
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 521, in _convert_next
value = convert_vector(obj, self._convert_next, attrs=attrs)
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 161, in convert_vector
conversion_function(o) for o in r_vec.value
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 161, in <listcomp>
conversion_function(o) for o in r_vec.value
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 516, in _convert_next
value = [self._convert_next(o) for o in obj.value]
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 516, in <listcomp>
value = [self._convert_next(o) for o in obj.value]
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 502, in _convert_next
default_encoding=self.default_encoding_used,
File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 207, in convert_char
return r_char.value.decode("utf_8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5: unexpected end of data
Ok, it seems that some of your strings are malformed (!?). I have changed the code to allow for those, only with a warning. Let me know if it works.
There are some (a lot, in every place) decoding errors:
home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:180: UserWarning: Exception while decoding b'El entrevistado referencia sobre situaciones de mediaci\xf3n con LET ME MASK THIS para la entrega de LET ME MASK THIS para los cuales fue delegado por la Di\xf3cesis. Expresa sus percepciones sobre el I ALSO MASK THIS que tuvo oportunidad de conocer con ocasi\xf3n del ejercicio de su sacerdocio.': 'utf-8' codec can't decode byte 0xf3 in position 55: invalid continuation byte
f"Exception while decoding {byte_str!r}: {e}",
Sorry I masked some info. but it's confidential. It seems that the decoding problems are due to Spanish accents such as "mediación" and so on
Yes, it seems that the encoding really is "latin1" but is wrongly marked. May I ask you to print the extra
field of the RData
object returned by parse_file
and tell me what it is?
BTW, maybe you can force latin1 encoding calling convert
with the additional argument default_encoding="latin1"
(but even if that works I would like to know the encoding selected by default, in order to fix the problem).
Sorry for the delay. Was too busy this weekend.
The extra field is this one:
>>> parsed.extra
RExtraInfo(encoding='CP1252')
With the latin1 encoding I get errors of this sort:
/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:180: UserWarning: Exception while decoding b'C\xf3ndor': 'utf-8' codec can't decode byte 0xf3 in position 1: invalid continuation byte
f"Exception while decoding {byte_str!r}: {e}",
Ok, the only way I see for that to happen is if your strings are marked with the wrong encoding. I have added an option (force_default_encoding
) to ignore the encoding of each particular string, using always the default encoding. Please, try setting that to True
and tell me if the problem persists.
Now I get this:
converted = rdata.conversion.convert(parsed, force_default_encoding=True)
/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:180: UserWarning: Exception while decoding b'R\xcd\x8dO BRAVO': 'charmap' codec can't decode byte 0x8d in position 2: character maps to <undefined>
f"Exception while decoding {byte_str!r}: {e}",
/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:590: UserWarning: Missing constructor for R class "POSIXct". The constructor for class "POSIXt" will be used instead.
stacklevel=1)
/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:590: UserWarning: Missing constructor for R class "POSIXt". The underlying R object is returned instead.
stacklevel=1)
/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:590: UserWarning: Missing constructor for R class "tbl_df". The constructor for class "tbl" will be used instead.
stacklevel=1)
/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:590: UserWarning: Missing constructor for R class "tbl". The constructor for class "data.frame" will be used instead.
stacklevel=1)
However, the data can now be accessed. Thanks a lot for your help.
If you want to implement a more precise conversion for a custom R class, you can use your own conversion routine. Check the documentation for that: https://rdata.readthedocs.io/en/latest/simpleusage.html#convert-custom-r-classes
I have released a new version (0.3) including these changes, so you do not longer need to install the development version. Thank you for helping me resolving these issues!
Hi, I'm trying to use your library. I failed with pyreadr because of a LibrdataError (there was an invalid byte sequence).
Now I'm doing this:
but it fails:
do you know why this might be happening?