swcarpentry / shell-novice

The Unix Shell
http://swcarpentry.github.io/shell-novice/
Other
392 stars 979 forks source link

Ep6: inaccurate description of docx files in the text file example #1000

Closed MontrealSergiy closed 5 years ago

MontrealSergiy commented 5 years ago

Lesson says that ".docx files to store not only text, but also formatting information about fonts, headings, and so on. This extra information isn’t stored as characters, and doesn’t mean anything to tools like head"

In fact it is not completely accurate. Docx files contain said formatting info as characters, yet files are zipped. Docx files are xml, zipped fo take less space. I think it is better to use executable binaries, photo, video or audio.

MontrealSergiy commented 5 years ago

https://www.forensicswiki.org/wiki/Word_Document_(DOCX)

gcapes commented 5 years ago

Thanks for the issue.

I'm not sure there's anything wrong with the current explanation. A zip file is not the same as a plain text file. Certainly the key points are that these docx files don't mean anything to commands like head, and as such you shouldn't use them to write shell scripts. Try this:

Create a test .docx file, image

then run head on it.

Downloads $  head test.docx 
PK���N
      _rels/.rels��MKA
                       ���C��l+����"Bo"�������3i���A
�P��Ǽy���m���N���AêiAq0Ѻ0jx�=/`�/�W>��J�\*�ބ�aI���L��41q��!fOR�<b"���qݶ��2��1��j�[���H�76z�$�&f^�\��8.Nyd�`�y�q�j4�
                                         x]h�{�8
                                                ��S4G�A�y�Y8X���(�[Fw�i4o|˼�l�^��͢����P��#�=PK���NdocProps/app.xml���n� ��}

�C@1KK�"��Q��˘'�&g�B/1��P��ONz�Ho+(��=��\��#�����/�Wg�x�W��NN��ʡ�*Ӄ`�x����(�i#bc>��T�"g�]>�l�=���eQ�l���@!e�g�����t��Ƿ��(f�^�sAG���:��V�0�����Y����IL�h|��eT׮i��oݠ�V-�P��P"PK���NdocProps/core.xml�R]O�0}�W,}ߺ
O�R�-�9#i�2��V%�AL�ґ�Jsz�W��^������P��I+�e����¿E��D2RV2t�f�UJUB+
����0�̙Rp{PpQړ�zo� ��:�G�������s;��e�*%�X�C_
;�G����b������dH5�,��Kw�56?8��$'����z�vq���=30Tse�
;���<n����l?>��n���r[B���?�1�P��:�c�PK���Nword/_rels/document.xml.rels��M
�0���"�ަU��nDp+�1���6 �(z{�Z(����}�1/__��]�m��,I��Q�Ҧp(��%��I��NR\  �v���������������?@��������I��wP�/0��PK���Nword/settings.xmlEO�N�@��>.?��|m
                                                                  ����tC���T\z�F�7qӕv���4��c��C3��=��BU��w��I�������#4j�3&a��v�M�vJf���X�u��Y�B��Lu#�ص�Ԍ��.a�:�*�z4��mۇ�12�^�)���+T'b�9m
                                    �[���o�x|1)n�`�~�'O�
                                                         �؇�v�+h/l���5՞4�/;�9:
?����
     P�|�>�PK���Nword/fontTable.xml�PAN�0��
�w�����U%�  �@��d�#�I��q�DB�B�f�����r����0��T����I���S%�����Rp:����<#��:��o�V�� �o8�އc�F����ZŃr`H2�72����'��R����ܘ��jH'�����{���{��P&���Ӂ�dQH���{�!�3К��q�A0p�x���������Nz-n��I�i�ɳ�7��z1
                                              �l��`��
6n:�|�[M%�ߺ��ɀx*ص��3<x�    P��J�UPK���Nword/document.xml�T�n�0��+�mIM�B�\�=40`�h���r�pd����-
`����[F�_�NN�r�d�:c���U�6%�u���gI@n+���%;��7_��r�3�bB
                                                          6�d�"�VVF p�ո�����A�peg�<�"p��;�5�����[#�\���ƚl�ӻ�_���۷�˅�����OlF\��p���81OO�����%�b*�����3�?�x~�(~4�?jD�DwlCw���s\��k�e�g-��8q]���Z�4"
                                               �e���9�*$��  ʀ¡`h�EA
��~V��w����o�ӟ���|����w�7�S�O���5Ƣ��X�i_m[�+I�6�t~Aj�pF�љl:�Q��3��jm���B�Ӊ�c�5�a
�����iT�����p
          s��5�4FZY�S(b���Q���
                               9�J�9��M����/Pj��X�*PK���Nword/styles.xmlŕaO�0���WD�^b�"b놺i�pu.��c{�C(�~��tmӎ�1�K�׳�����]<U<zDm�)9<HH��ʜ�iJ���
                                                                      NId,����rq���;�h"�/̰IIi�Ʊ�%V`�B�b��X��Ӹ�:WZR4�m_��(IN�
� �6�ǽ�*F�4��TV�,

What you get is mostly not ASCII characters, and certainly isn't the first ten lines of what you wrote.

Thanks again for raising the issue, and do let me know if I've missed something.

gdevenyi commented 5 years ago

I agree with @gcapes here, for the purposes of this lesson, the description is accurate. The tooks we are covering can't handle the file as text.

Thanks for your suggestions!

MontrealSergiy commented 5 years ago

I argue with a statement that formatting details such as font sizes are NOT encoded as characters, while they are exactly encoded as characters even though the whole Docx aka Office Open XML document is additionally encoded / compressed into zip format. In fact some advanced text editors such as vim can edit zipped text document. Head, cat, tail, sed and such can be used too with aid of zip/unzip filters or pipes .