Preview for text files in version 1.6.0

NataliaBondarenko commented 4 years ago

I propose to completely solve the issue with previewing text files in this version. And we can leave a preview for the binaries until the next version of the package.

Option 1 1) Try to open all files in text mode. 2) Show a preview of the text of the file, if possible. Otherwise, show information or error message.

Option 2 with skipping some known binaries For example, files with known signatures.

# short list of known binaries
binaries = ['gz', 'jpg', 'png', 'mp3', ...]

def generate_preview(filepath: str, max_size: int = 390) -> str:
    extension = get_file_extension(filepath, case_sensitive=False).lower()

    if extension in binaries:
        # skip the extension in version 1.6.0
        return "[A preview of this file type is not yet implemented.]"
    else:
        # try to open other files in text mode
        excerpt = generic_text_preview(filepath, max_size)
        if excerpt:
            # return excerpt or error string
            return f"{excerpt}"
        else:
            return "[This file can be empty.]"

victordomingos commented 4 years ago

I would rather issue a 1.6 with the current feature set and then make 1.7 the one with the file previews, if it's ok for you. Will still need to see what's missing from documentation, with regard to the latest changes, and make sure the translations are on sync.

NataliaBondarenko commented 4 years ago

Hello! CLI help is updated with previous PR. English docs were last updated for search by pattern (--filename-match argument). This is a priority task. Other docs are more outdated. But few people will notice it. In terms of traffic, people do not often view these pages.

TODO:

add tags to generate documentation on the Read the Docs Tags are needed in this repository on commits https://github.com/victordomingos/Count-files/pull/86/commits/f9559b6e0969ca43320f0773e2bf306c58a5e85e (1.4.0) and https://github.com/victordomingos/Count-files/pull/107/commits/e94d15229a4761ba1795446e2cbe282c3c73914c (1.5.0) Could you add tags for the corresponding versions?
conduct tests for this version and update the list of tested operating systems

NataliaBondarenko commented 4 years ago

Also:

add version branches similar to Django To be able to fix errors for active versions (https://github.com/victordomingos/Count-files/issues/118). To be able to support incompatible active versions/extensions.

NataliaBondarenko commented 4 years ago

Previewing text files is an old issue. Features that were not even planned were added to this version. Why postpone the preview solution?

victordomingos commented 4 years ago

Ok. Let’s improve the preview for text files for 1.6 and let binary formats for later. Special care must be put in choosing which files are binary or text, and proper treatment of any exceptions.

Regarding branches, until now all versions were intended to be compatible backwards, so it made some sense to fix any bugs in the next update within the same branch. Our public releases are published on PyPI, not on GitHub’s development repo. When we decide to switch to v2.x, then yes, we must keep a separate v1.x branch for bug fixes.

I believe I have added tags for all previous releases, could you please check again? I missed one release, so I added a new tag recently. Maybe that’s the one you were referring to?

With regards to tests, I can test on macOS Catalina, iOS/Pythonista, Haiku R1/beta2 and maybe a few virtual machines. The last time I tested on macOS, I got one failing test. I believe it has something to do with the creation of a comparison file, and you have already explained that to me but I confess I can’t remember. I will submit an issue to see if you are able to help, ok?

Finally, keeping documentation in sync across different languages can easily become a mess. I would like to find some sort of technical solution to help keep them synchronized, but not sure what the best solution is. I know there are some specialized web apps, like Pootle which I have used for Haiku, but that would require setting up a server and probably some costs. I have heard of GlobalSight and OmegaT, but I haven’t tried any of those yet.

NataliaBondarenko commented 4 years ago

Hello! I have updated the preview for text files. New branch https://github.com/NataliaBondarenko/Count-files/tree/textpreview/count_files This version allows us to extend the preview capabilities without external dependencies.

This version is proposed by me for discussion. This has its pros and cons. What do you think?

Updated `def generic_text_preview`

https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/file_preview.py#L12

Added encoding in open(filepath, mode='r', encoding='utf-8').

UTF-8 is one of the most commonly used encodings (w3techs.com stats). UTF-8 has several convenient properties: docs.python.org Unicode HOWTO

Also, this encoding renders text with mixed characters (like Cyrillic and Latin) quite correctly. I tried this with the README files in the repository as well as a Japanese text file.

The previous version of this function was with open (filepath, mode = 'r'). Docs: If encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. This option is left as a fallback for opening files. First, we try to open a file with encoding='utf-8'. If this fails (UnicodeDecodeError), then we try to open the file with the user's preferred encoding.

Added new `shell-command` argument to Search group

https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/help_text.py#L397

The idea is to use the Unix "file" command, a file type detector (wiki). Using this program allows the CLI to detect text files with or without an extension and display a preview of those files. Determining the file type is done with this command through the subprocess module. In general, it gets the output of $ file /path/to/file.ext.

Depending on whether this program is available, we can create a preview with different functions. def generate_preview_with_file https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/file_preview.py#L91 or def generate_preview https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/file_preview.py#L61 In this case, preview is only available for files with certain extensions. This function can be used for all operating systems.

I have added two functions to check if the Unix "file" command is available and works as expected. https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/file_handlers.py#L78

This utility works with files pretty quickly. The "file" command is a standard program on Unix and Unix-like OS. It is also ported to Windows. This program can be used on Windows. For example, if the user installed it along with Git (https://git-scm.com) and added it to the PATH environment variable. I don't see anything like this for Haiku and StaSh. Thus, the use of this argument is limited to desktop operating systems such as Linux, Mac OS, and Windows.

If this version is appropriate, there will be no more significant changes for v1.6.

TODO:

make some tests
update documentation
update supported types https://github.com/victordomingos/Count-files/issues/123

NataliaBondarenko commented 4 years ago

Ok. Let’s improve the preview for text files for 1.6 and let binary formats for later.

I think we can preview some binaries using the Python standard library. For example, if you want a list of files with the same extension. count-files --file-extension ext_name --preview For all files in a directory --file-extension .., choosing the correct function and processing the files can slow down the program.

Special care must be put in choosing which files are binary or text, and proper treatment of any exceptions.

Determining which files are binary or text files is difficult. To increase the likelihood of correctly detecting the file type, we can use OS utilities. I already mentioned the "file" command in the comment above.

Regarding branches, until now all versions were intended to be compatible backwards, so it made some sense to fix any bugs in the next update within the same branch. Our public releases are published on PyPI, not on GitHub’s development repo. When we decide to switch to v2.x, then yes, we must keep a separate v1.x branch for bug fixes.

Ok. It makes sense to me.

I believe I have added tags for all previous releases, could you please check again? I missed one release, so I added a new tag recently. Maybe that’s the one you were referring to?

There were changes in the version documentation after these tags. I made small clarifications to the text of the documentation, not to the code itself later. Existing tags do not cover several pull requests.

With regards to tests, I can test on macOS Catalina, iOS/Pythonista, Haiku R1/beta2 and maybe a few virtual machines.

I have Windows and Linux.

The last time I tested on macOS, I got one failing test. I believe it has something to do with the creation of a comparison file, and you have already explained that to me but I confess I can’t remember. I will submit an issue to see if you are able to help, ok?

Comparison files are generated automatically in the latest tests. It might be an old test file.

Finally, keeping documentation in sync across different languages can easily become a mess. I would like to find some sort of technical solution to help keep them synchronized, but not sure what the best solution is. I know there are some specialized web apps, like Pootle which I have used for Haiku, but that would require setting up a server and probably some costs. I have heard of GlobalSight and OmegaT, but I haven’t tried any of those yet.

I suggest maintaining only English documentation (Read The Docs and README) after v1.6.

victordomingos commented 4 years ago

Hi! I had a quick look over your new branch and it seems a nice improvement indeed. Thanks.

As usual, documentation must be clear about availability issues and IMO it should also include some guidance on how to get it to work on Windows.

This utility works with files pretty quickly. The "file" command is a standard program on Unix and Unix-like OS. It is also ported to Windows. This program can be used on Windows. For example, if the user installed it along with Git (https://git-scm.com) and added it to the PATH environment variable. I don't see anything like this for Haiku and StaSh. Thus, the use of this argument is limited to desktop operating systems such as Linux, Mac OS, and Windows.

Actually, I believe we can also count with file availability on Haiku:

iOS/StaSh has no file binary, so in this case we must make sure that a proper message is given to the user.

With regards to multilingual documentation, I didn't give up on it yet. The English version will be the master, but any changes should be properly identified so that the translators know where to look for. I intend to keep maintaining at least the Portuguese translation (it can be kept in that single markdown file).

NataliaBondarenko commented 4 years ago

As usual, documentation must be clear about availability issues and IMO it should also include some guidance on how to get it to work on Windows.

Actually, I believe we can also count with file availability on Haiku:

Currently, command availability checking is limited to specific operating systems (win, linux, darwin). https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/file_handlers.py#L130 This limitation can be removed. We can try using the "file" command on any operating system.

The --shell-command argument can take either a command name or the path to an executable file.

--shell-command file or --shell-command /path/to/file This can be useful on systems where the "file" command is not standard. That is, you can install the program and use it without adding it to your PATH environment variable.

With regards to multilingual documentation, I didn't give up on it yet. The English version will be the master, but any changes should be properly identified so that the translators know where to look for. I intend to keep maintaining at least the Portuguese translation (it can be kept in that single markdown file).

A shorter version of the documentation in one markdown file for each language?

victordomingos commented 4 years ago

Currently, command availability checking is limited to specific operating systems (win, linux, darwin). https://github.com/NataliaBondarenko/Count-files/blob/textpreview/count_files/utils/file_handlers.py#L130 This limitation can be removed. We can try using the "file" command on any operating system.

The --shell-command argument can take either a command name or the path to an executable file.

--shell-command file or --shell-command /path/to/file This can be useful on systems where the "file" command is not standard. That is, you can install the program and use it without adding it to your PATH environment variable.

This information may be useful, especially the shutil.which(command) part:

https://stackoverflow.com/questions/11210104/check-if-a-program-exists-from-a-python-script

With regards to multilingual documentation, I didn't give up on it yet. The English version will be the master, but any changes should be properly identified so that the translators know where to look for. I intend to keep maintaining at least the Portuguese translation (it can be kept in that single markdown file).

A shorter version of the documentation in one markdown file for each language?

I am not sure if we can make it much shorter without leaving some features undocumented, but we may consider keeping it in a single file if it helps. At this time, we have that situation in Portuguese (a short Readme and a longer single-file documentation). The simplest workflow (not necessarily the best one though) would be going back to a single file per language, merging back readme and documentation. That would let us with a single documentation file for each language.

Now, the most important IMHO bit is to establish a workflow. For instance, whenever the user interface changes, e.g. a new feature is added/removed or it gets a new behaviour, the developer could also add a new issue indicating the changes that need to be updated in the documentation. If possible, the English version should be updated together with the code pull request itself, so that at least the English documentation is always up to date. The issue tracker would let us keep track of any sections that need to have their translation updated. What do you think?

victordomingos / Count-files