vnmabus / rdata

Reader of R datasets in .rda format, in Python
https://rdata.readthedocs.io
MIT License
45 stars 2 forks source link

NotImplementedError: Unknown file type #6

Closed VolodyaCO closed 3 years ago

VolodyaCO commented 3 years ago

Hi, I'm trying to use your library. I failed with pyreadr because of a LibrdataError (there was an invalid byte sequence).

Now I'm doing this:

import rdata

parsed = rdata.parser.parse_file('../input/Consulta.rda')

but it fails:

parsed = rdata.parser.parse_file('../input/Consulta.rda')
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
~/Documents/Projects/ComisionDeLaVerdad/label-descriptive-analysis/common_labels/src/delete_me.py in 
----> 7 parsed = rdata.parser.parse_file('../input/Consulta.rda')

~/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py in parse_file(file_or_path)
    513             binary_file = buffer
    514         data = binary_file.read()
--> 515     return parse_data(data)
    516 
    517 

~/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py in parse_data(data)
    598         return parse_data(bz2.decompress(data))
    599     elif filetype is FileTypes.gzip:
--> 600         return parse_data(gzip.decompress(data))
    601     elif filetype is FileTypes.xz:
    602         return parse_data(lzma.decompress(data))

~/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py in parse_data(data)
    605         return parse_rdata_binary(view)
    606     else:
--> 607         raise NotImplementedError("Unknown file type")
    608 
    609 

NotImplementedError: Unknown file type

do you know why this might be happening?

vnmabus commented 3 years ago

Do you know if you are using a version 2 or version 3 rdata file?

VolodyaCO commented 3 years ago

Unfortunately I don't

vnmabus commented 3 years ago

Could you show me the first bytes of the file? There should be a magic number there that can be used to infer the file format. Also, sometimes the file Unix command can tell you that, if the format is known.

VolodyaCO commented 3 years ago
head Consulta.rda 
��{�}K]�        ���~�Q"�߽׾*0�d�h'�`f�!m�ı!3"��!c�1"�(�81������"c��0I��
f�!�T��`������1��g1��<��7`��M�'b����'c���̟�<�m�?��V̟����                ����k�]{����7?L�������SϪ]�ZU�j��]�Z���gF�gVR�T6�-`�XS�?��z�N�r~�S�T9v}�w������R�n��[���SO����a����|
�t�30ߎ�/0��1����1��|�/`�7�8}������l�_�������W���`��������^���>�����5�<�_��0�G�������ļ�C��0?�y!�~�/¼�#��`^������oa~���\�6��0�w1?��       ����$��c��
                                                                                                                                                  �r�?�������
                                                                                                                                                             敘�ż��k1���o��r�G�_���1��K��y=&NC��W0oļ)u*C�=�͘1o���6̯bގy�ż
         �k�_Ǽ�{1�
                  ������a�9惘�c~��|��C�c�|�Q�oa>��m��1�`>��$�����w0��/1��|�Y����O��+���9����
��*����t��p��M�1                                                                            �k̿��[������'���;��c>��a����0��0�o�`��������C        -�X�P��brN�f��)a 7
                <�{�>f�bF�1f��a�&�K��W`�C9LH�it ��y�i� ��с4:�F��@H��
                                                                    �т4Z�F
H�i4 ��р4��с4:�F��@H��F�h@@ч4���?                                         �hA-H�|��0���4̧a>�D�7c�>
�s��Ǡit ���!�u� �Z�F
�i�O�}��p��ل�4�a?��3p������Ӱ���4��>

file Consulta.rda 
Consulta.rda: gzip compressed data, from HPFS filesystem (OS/2, NT), original size modulo 2^32 67449241
vnmabus commented 3 years ago

Ok, and can you unzip it with gzip? (Note that rdata will do it automatically if the file is well formed, I only need you to do this manually because I want to inspect the header of the uncompressed file).

VolodyaCO commented 3 years ago

After unzipping:

file Consulta
Consulta: data

head -3 Consulta
RDX3
X
�$%&'(�b)*+,-./012345^6789Q:;<=>?@BCDEFGHRIJKLWMNOPSTUVXYZ[\]^_`abcdefgmhijknopqrstuvw}xyz{|~���������������������������������������������������������������������������������������������������������6����7����������������
vnmabus commented 3 years ago

Ok, so the magic number RDX3 is for version 3 of the RData format, which is very recent. Currently, this package only supports version 2.

I did not have enough time to look at the changes that version 3 brings to the data format. As far as I know, the format is VERY similar, but the header has a new field containing the default encoding of the strings. It should not be very difficult to modify the code to skip that part (or to use it as intended, for strings without encoding info), but currently I have not time for this.

So, two options here:

If you have access to an R installation, you can load the data and save it again specifying the version 2 format.

OR

You can try to understand the additions to the format and modify the parser to allow this format, maybe ignoring this new field as a first implementation. If you want to do that I will review your PR.

VolodyaCO commented 3 years ago

Ok, let me find if my team can export this to version 2 format. If not, I'll try to find the time to submit a PR.

vnmabus commented 3 years ago

I know I told you that I did not have time (and maybe I should have been doing other things), but I was bored tonight and... well, look at the feature/version3 branch ;-)

Please, tell me if that is enough for your use case, or you need full version 3 support.

VolodyaCO commented 3 years ago

After installing with pip install git+https://github.com/vnmabus/rdata.git@feature/version3 I obtained this:

>>> import rdata
>>> parsed = rdata.parser.parse_file('Consulta.rda')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 515, in parse_file
    return parse_data(data)
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 600, in parse_data
    return parse_data(gzip.decompress(data))
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 607, in parse_data
    raise NotImplementedError("Unknown file type")
NotImplementedError: Unknown file type

I renamed .rda to .gz but the output didn't change

vnmabus commented 3 years ago

Hmm, that should not happen. Can you verify that the installation has been done properly? To do that please tell me if the following code executes without errors:

import rdata

print(rdata.parser._parser.FileTypes.rdata_binary_v3)
VolodyaCO commented 3 years ago

Yeah I did not install it properly. I uninstalled it and then installed it again. Now it's in the right version, but I'm still getting an error:

>>> import rdata
>>> print(rdata.parser._parser.FileTypes.rdata_binary_v3)
FileTypes.rdata_binary_v3
>>> parsed = rdata.parser.parse_file('Consulta.rda')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 545, in parse_file
    return parse_data(data)
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 631, in parse_data
    return parse_data(gzip.decompress(data))
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 636, in parse_data
    return parse_rdata_binary(view)
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 652, in parse_rdata_binary
    return parser.parse_all()
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 267, in parse_all
    obj = self.parse_R_object()
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 343, in parse_R_object
    car = self.parse_R_object(reference_list)
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 396, in parse_R_object
    value[i] = self.parse_R_object(reference_list)
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 396, in parse_R_object
    value[i] = self.parse_R_object(reference_list)
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 396, in parse_R_object
    value[i] = self.parse_R_object(reference_list)
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/parser/_parser.py", line 355, in parse_R_object
    f"Length of CHAR cannot be {length}")
NotImplementedError: Length of CHAR cannot be -1
vnmabus commented 3 years ago

Interesting. A few weeks ago, in response to #1, I changed the code to consider that empty strings have always 0-length (I had -1 before, but I did not remember why, and in my tests the empty string has 0-length). I do not have any problem changing it again to accept strings with -1 length as empty, but I would like to know (if you have that info) how that string came to be, in order to add a proper test.

VolodyaCO commented 3 years ago

I don't have that info, sorry. It can be a null value of some field, maybe?

vnmabus commented 3 years ago

Well, that is unfortunate, but let's solve your problem first. I have changed the code to accept -1 length as the empty string. Please let me know if your data can be imported now, and if you miss some field. I will try to investigate the -1 length later.

vnmabus commented 3 years ago

I don't have that info, sorry. It can be a null value of some field, maybe?

It seems that you are right! R strings are nullable, and a NA string is stored as a string with length -1. I have changed the code in the branch to parse that as None in Python, and added a test for that. Please tell me if it works for you, in order to merge the PR.

VolodyaCO commented 3 years ago

It is parsing, but I cannot convert them apparently due to a unicode issue:

>>> import rdata
>>> parsed = rdata.parser.parse_file('Consulta.rda')
>>> converted = rdata.conversion.convert(parsed)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 617, in convert
    return SimpleConverter(*args, **kwargs).convert(data)
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 454, in convert
    return self._convert_next(data)
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 486, in _convert_next
    value = convert_list(obj, self._convert_next)
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 75, in convert_list
    return {tag: conversion_function(r_list.value[0]), **cdr}
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 521, in _convert_next
    value = convert_vector(obj, self._convert_next, attrs=attrs)
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 161, in convert_vector
    conversion_function(o) for o in r_vec.value
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 161, in <listcomp>
    conversion_function(o) for o in r_vec.value
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 521, in _convert_next
    value = convert_vector(obj, self._convert_next, attrs=attrs)
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 161, in convert_vector
    conversion_function(o) for o in r_vec.value
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 161, in <listcomp>
    conversion_function(o) for o in r_vec.value
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 516, in _convert_next
    value = [self._convert_next(o) for o in obj.value]
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 516, in <listcomp>
    value = [self._convert_next(o) for o in obj.value]
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 502, in _convert_next
    default_encoding=self.default_encoding_used,
  File "/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py", line 207, in convert_char
    return r_char.value.decode("utf_8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5: unexpected end of data
vnmabus commented 3 years ago

Ok, it seems that some of your strings are malformed (!?). I have changed the code to allow for those, only with a warning. Let me know if it works.

VolodyaCO commented 3 years ago

There are some (a lot, in every place) decoding errors:

home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:180: UserWarning: Exception while decoding b'El entrevistado referencia sobre situaciones de mediaci\xf3n con LET ME MASK THIS para la entrega de LET ME MASK THIS para los cuales fue delegado por la Di\xf3cesis. Expresa sus percepciones sobre  el I ALSO MASK THIS que tuvo oportunidad de conocer con ocasi\xf3n del ejercicio de su sacerdocio.': 'utf-8' codec can't decode byte 0xf3 in position 55: invalid continuation byte
  f"Exception while decoding {byte_str!r}: {e}",

Sorry I masked some info. but it's confidential. It seems that the decoding problems are due to Spanish accents such as "mediación" and so on

vnmabus commented 3 years ago

Yes, it seems that the encoding really is "latin1" but is wrongly marked. May I ask you to print the extra field of the RData object returned by parse_file and tell me what it is?

vnmabus commented 3 years ago

BTW, maybe you can force latin1 encoding calling convert with the additional argument default_encoding="latin1" (but even if that works I would like to know the encoding selected by default, in order to fix the problem).

VolodyaCO commented 3 years ago

Sorry for the delay. Was too busy this weekend.

The extra field is this one:

>>> parsed.extra
RExtraInfo(encoding='CP1252')

With the latin1 encoding I get errors of this sort:

/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:180: UserWarning: Exception while decoding b'C\xf3ndor': 'utf-8' codec can't decode byte 0xf3 in position 1: invalid continuation byte
  f"Exception while decoding {byte_str!r}: {e}",
vnmabus commented 3 years ago

Ok, the only way I see for that to happen is if your strings are marked with the wrong encoding. I have added an option (force_default_encoding) to ignore the encoding of each particular string, using always the default encoding. Please, try setting that to True and tell me if the problem persists.

VolodyaCO commented 3 years ago

Now I get this:

converted = rdata.conversion.convert(parsed, force_default_encoding=True)
/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:180: UserWarning: Exception while decoding b'R\xcd\x8dO BRAVO': 'charmap' codec can't decode byte 0x8d in position 2: character maps to <undefined>
  f"Exception while decoding {byte_str!r}: {e}",
/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:590: UserWarning: Missing constructor for R class "POSIXct". The constructor for class "POSIXt" will be used instead.
  stacklevel=1)
/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:590: UserWarning: Missing constructor for R class "POSIXt". The underlying R object is returned instead.
  stacklevel=1)
/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:590: UserWarning: Missing constructor for R class "tbl_df". The constructor for class "tbl" will be used instead.
  stacklevel=1)
/home/vladimir/.local/share/virtualenvs/ComisionDeLaVerdad-FivqEOe7/lib/python3.7/site-packages/rdata/conversion/_conversion.py:590: UserWarning: Missing constructor for R class "tbl". The constructor for class "data.frame" will be used instead.
  stacklevel=1)

However, the data can now be accessed. Thanks a lot for your help.

vnmabus commented 3 years ago

If you want to implement a more precise conversion for a custom R class, you can use your own conversion routine. Check the documentation for that: https://rdata.readthedocs.io/en/latest/simpleusage.html#convert-custom-r-classes

vnmabus commented 3 years ago

I have released a new version (0.3) including these changes, so you do not longer need to install the development version. Thank you for helping me resolving these issues!