pattersonkl / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Using a URL with the importer should look at the file extension if mime-type is application/octet-stream #297

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Use a simple web server, like `python -m SimpleHTTPServer` to serve files 
from a directory over http on localhost.
2. Try importing the URL http://localhost:8000/filename.tar.gz where 
filename.tar.gz is an archive file of any type.

What is the expected output? What do you see instead?

You expect the archive type to be detected, and the files within to be 
extracted and and displayed properly. Instead, you see a number of columns/rows 
with random characters like this:

���"-VOVP��n��k�u�q�>3�g��ާ��A�}}mJ#���
���9���5)����'�����f#�������:.�|7X
o>��-���A�Xxn�zn`�A����ʗ��2������
(o�—�:X5)��H�n��l��ie�d���6,���F�f��-
S���-�G��聈�Km<`�y�d���:�S��Q�q|욮��i
�6��A]*%��a3u�˼�l���H�m�`�߈z 
c?��1ao7t���As�ٱqF�%�}I�>W��WB��X���
�E�^��x�Y��Zm��lG�n4h)`�8�e��8V],$c���
��I+^͋tM}M���e��KYR�vXO�����K�m�1��Q�
��3D��zAV,��� 
�u�RY�ӻ�$x�L��פ�2�uu햧��Zk���c�rݻn�ͷ
4��Awo#n����[��Vz9'��������[B^Ϡ�TfW5�d
a�gˏx5�gK��c��=B�ޯQG�� �؎ j�}    �i; 
���_6oI2�{�Z�x3ޑΐ=��UN�3Kc�g������k�
K&XԳ��zN� 
���l^^g�=�\��Dp�:qi�&Ɍ�"�[:����vlߑJ�#�
v`�D��ӉѾ�L��y�eO/�'����u%u@��*���S8x���
��IǸE�����2�5�^[�k~�n����8�e�9�)���
���� 
�f{�Qc��밌6V�[�τ�����5<��"��AY�g��
��������۵�y��^�x�����WaH@�A��4��
�E�� 
�d`!  -zY��{`����-i��0j؋2:[�$r܎&w�Q��Se��Bs�
��~��-쇼���V 
Ğ�\��P�\�t�0�!��d�)�WiۮxG�C�jAOaX�}/��0�
���Qv}���6�E���h���i�(�p�m�I������
������u�^��>���B�Ǩ����9vM]!���% 
B����&^$ʔUB"������ 
��@���CL��N4�M:#����  \���N�hȴ��� 
ʎ�`Z��l��=�l�o�_o��#jV-$8���C���OG�9%L�
���1$7�D&��(�G�C�z�U�����u�J6�{nI���
��]΅�}I��p\|��i��u:7��ӌ��\��2�

I surmised that the reason is that many simple HTTP servers (I tried two), will 
use application/octet-stream as the mime-type. It would be nicer if the file 
extension were used to detect the file type in this case.

What version of the product are you using? On what operating system?

Refine 2.0 on Centos 5.5.

Original issue reported on code.google.com by ehsanul...@gmail.com on 29 Dec 2010 at 3:34

GoogleCodeExporter commented 8 years ago
It's more likely simple web masters than simple web servers that's the problem. 
 Even Python's SimpleHTTPServer takes a file extension->MIME type map. 
http://docs.python.org/library/simplehttpserver.html

I'm not sure what the best solution is here.  One could go from file 
extensions, to trying to sniff the beginning bytes of the file, to ...  It 
seems like a slippery slope, but perhaps one worth venturing a little ways down.

Original comment by tfmorris on 29 Dec 2010 at 6:53