victordomingos / Count-files

A CLI utility written in Python to help you count files, grouped by extension, in a directory. By default, it will count files recursively in current working directory and all of its subdirectories, and will display a table showing the frequency for each file extension (e.g.: .txt, .py, .html, .css) and the total number of files found.
https://no-title.victordomingos.com/projects/count-files/
MIT License
23 stars 9 forks source link

Suggestion: add sorting of file extensions by type to Count group #117

Open NataliaBondarenko opened 4 years ago

NataliaBondarenko commented 4 years ago

Sort extensions in Counter by type, for example: image - jpg, png, gif, bmp... video - mp4, avi, flv, 3gp... audio - mp3, wav, ogg... archive - zip, tar, rar, gz... and so on

This can be either a table with sections, or for each type separately.

victordomingos commented 4 years ago

The group by type is interesting. Doing tables without external dependencies can be a[n unnecessary] challenge. I wouldn't mind having an optional pure-python dependency, if it helps.

NataliaBondarenko commented 4 years ago

We already have a table.

Here are two options for displaying data:

from collections import Counter

# create types map, common types
types2 = {
    'jpg': 'images',
    'png': 'images',
    'mp4': 'videos',
    'gz': 'archives',
    'zip': 'archives',
    'py': 'python files',
    'pyc': 'python files',
    'pyw': 'python files',
    'woff': 'fonts',
    'bat': 'executables',
    'txt': 'documents',
    'md': 'documents',
    'xml': 'documents',
    'rst': 'documents',
    'json': 'documents',
    'html': 'documents',
    'css': 'documents',
    'xlsx': 'documents'
    }

# input data
c = Counter(
    {'PYC': 38, 'TXT': 27, 'PY': 23, 'MD': 15,
     '[no extension]': 9, 'PNG': 8, 'XML': 5,
     'RST': 4, 'GZ': 3, 'IML': 1, 'BAT': 1,
     'HTML': 1, 'JSON': 1, 'CSS': 1,
     'WOFF': 1, 'DAT': 1, 'XLSX': 1}
    )
# result template
result = {
    'archives': [],
    'documents': [],
    'executables': [],
    'fonts': [],
    'images': [],
    'python files': [],
    'videos': [],
    'other': []
    }
# update result template with (extension, freq)  
for k, v in c.items():
    item_type = types2.get(k.lower(), 'other')
    result[item_type].append((k, v))

#import pprint
#pprint.pprint(result, indent=2)
#print()

print('------------ COMPACT ------------')
for k, v in result.items():
    if v:
        total_occurrences = sum([x[1] for x in v])
        print()
        print(f'{k.upper()}({total_occurrences})')
        for i in v:
            print(f'{i[0]}: {i[1]}')
    else:
        pass
print(f'\nFound {sum(c.values())} file(s).\n')

print('------------- TABLES ------------\n')
from typing import List
from textwrap import wrap
from count_files.settings import TERM_WIDTH, DEFAULT_EXTENSION_COL_WIDTH, DEFAULT_FREQ_COL_WIDTH, MAX_TABLE_WIDTH

def show_2columns(data: List[tuple],
                  max_word_width: int, total_occurrences: int,
                  term_width: int = TERM_WIDTH, title: str = None):
    """Displays a sorted table with file extensions.

    :param data: list with tuples
    default in uppercase: [('TXT', 24), ('PY', 17), ('PYC', 13), ...]
    with --case-sensitive as is: [('txt', 23), ('py', 17), ('pyc', 13), ...]
    :param max_word_width: the longest extension name
    :param total_occurrences: total number of files found
    :param term_width: the size of the terminal window
    :return: the processed data as text to the screen.
    """
    if not data:
        print("Oops! We have no data to show...\n")
        return

    max_word_width = max(DEFAULT_EXTENSION_COL_WIDTH, max_word_width)
    freq_col_width = max(DEFAULT_FREQ_COL_WIDTH, len(str(total_occurrences)))
    ext_col_width = min((term_width - freq_col_width - 5),
                        max_word_width,
                        MAX_TABLE_WIDTH)
    header = f" {'EXTENSION'.ljust(ext_col_width)} | {'FREQ.'.ljust(freq_col_width)} "
    sep_left = (ext_col_width + 2) * '-'
    sep_center = "+"
    sep_right = (freq_col_width + 2) * '-'
    sep = sep_left + sep_center + sep_right
    if title:  # start of type table
        print(f'{title.upper()}({total_occurrences})'.ljust(ext_col_width))
        print(sep)
    print(header)
    print(sep)

    for word, freq in data:
        if len(word) <= ext_col_width:
            print(f" {word.ljust(ext_col_width)} | {str(freq).rjust(freq_col_width)} ")
        else:
            head = f" {word[0: ext_col_width]} | {str(freq).rjust(freq_col_width)}"
            word_tail = wrap(word[ext_col_width:],
                             width=ext_col_width,
                             initial_indent=' ' * 2,
                             subsequent_indent=' ' * 2)
            print(head)
            for line in word_tail:
                print(f" {line.ljust(ext_col_width)} | {' '.rjust(freq_col_width)}")
    if title:  # end of type table
        print(sep + "\n")
        return
    print(sep)
    line = f" {'TOTAL:'.ljust(ext_col_width)} | {str(total_occurrences).rjust(freq_col_width)} "
    print(line)
    print(sep + "\n")
    return

max_word_width = max(map(len, c.keys())) # same for all tables(ext column width)
for k, v in result.items():
    if v:
        total_occurrences = sum([x[1] for x in v])  # for each table(freq column width)
        show_2columns(data=v, max_word_width=max_word_width, total_occurrences=total_occurrences, term_width=TERM_WIDTH, title=k)
    else:
        pass
print(f'Found {sum(c.values())} file(s).')
victordomingos commented 4 years ago

The first one looks a bit cleaner, IMHO. I would add a 2 or 4 spaces indentation level for extensions. It needs to use text justification for better alignment.

NataliaBondarenko commented 4 years ago

Yes, the first list is easier to read. I can try text justification, textwrap, indentation, and see what looks better.

NataliaBondarenko commented 4 years ago

Hello! I made a prototype of sorting extensions by type. This option allows users to create, rename, update, and delete custom extension lists for sorting by type. It can be extensions related to each other (python files) or any extensions. Lists are stored in a file, and the utility provides only an interface for sorting. The --sort-type argument accepts an arbitrary list of types. The reserved default value creates or restores a configuration file with default values, description and examples. count-files --sort-type default I made some predefined lists: archives, audio, videos, images, python files, ... However, they can also be renamed, changed, or deleted. Could you take a look at this? If you have any suggestions or ideas, please write.


I also found that the table is not displayed if the terminal width is too small and the frequency is too long. In textwrap wrap(width=n) does not accept negative values.

ext_col_width = min((term_width - freq_col_width - 5), <- here negative value
                        max_word_width,
                        MAX_TABLE_WIDTH)

If devices with a small screen can have many files, we must deal with this. total_occurrences = 10000000000000000000000000000 :D

victordomingos commented 4 years ago

Hi!

I made a prototype of sorting extensions by type.

Maybe we should call this feature "group by type" (-g/--group-by-type). "Grouping" is more specific and more explicit of what we are trying to accomplish here, in my opinion, since we already have a "sorting" option that is related to alphabetic order.

This option allows users to create, rename, update, and delete custom extension lists for sorting by type. It can be extensions related to each other (python files) or any extensions. Lists are stored in a file, and the utility provides only an interface for sorting.

This is a good idea, since it allows some degree of personalisation that may make sense in a number of use cases. However, in a utility like this I would rather provide a default implementation that does not rely on the creation of an additional file. By using -g/--group-by-type, the user would get the default grouping (we must ponder what should be the best default categories, before releasing this feature, I will think more about it later). That default configuration file should live in the application folder itself, in order to avoid unnecessary file system pollution.

Then, the more advanced user, the one who would want to use a custom configuration, could simply use another CLI argument (e.g.: -cfg/--config-file) to specify a customised configuration file. I see two possible scenarios here:

  1. there is one global configuration file that lives in the corresponding user settings folder according to the OS convention (this could be useful: Config-path)

or:

  1. a configuration file is created for each folder/project, so it lives in the folder being scanned.

I believe option number 1 (global configuration file) is the best one for the scope of this application. Config Path would help in order to ensure that files are saved in the correct location for each platform, but I did not test it yet. Also, in this application I would prefer to stick to the original requirement of keeping external dependencies to the minimum, preferably just pure-python and optional packages. Features depending on third-party packages should be treated as "extras" and be able to display a nice and brief informative message when the required package is not installed.

The use of a user-level customised configuration file (`-cfg`/`--config-file`) 
requires a python packaged that currently is not installed in your system. You 
can easily install it by using the following command (replace `python3.8` with 
the intended Python interpreter):

    python3.8 -m pip install config-path

The --sort-type argument accepts an arbitrary list of types. The reserved default value creates or restores a configuration file with default values, description and examples. count-files --sort-type default I made some predefined lists: archives, audio, videos, images, python files, ... However, they can also be renamed, changed, or deleted.

I am not sure that accepting a list of types is the best behaviour. Following the CLI options I have described above, I would say that on the first time the application is run with the -cfg/--config-file option the application tell the user something like this:

You haven't created a customised configuration file yet. In case you want to 
use the default configuration, please use the command `-g`/`--group-by-type` 
instead. Do you want to create the configuration file now, which you can later 
customise using a text editor)? [Y/n]

If the user presses Enter or enters Y/y, the file is created:

A custom configuration file has been created. To make any changes, just edit  
the file /home/username/xxxxxxxxxx/count-files.ini in your text editor.

The correct path should be displayed, of course.

What do you think?

NataliaBondarenko commented 4 years ago

Hi! There are a few thoughts.

Naming Yes, it can be called "Grouping." The name -type/--sort-type was chosen because it is in the same style as -alpha/--sort-alpha. The name of the argument may be shorter. -g/--group-by-type -g/--groups -g/--group

Default implementation that does not rely on the creation of an additional file I agree, the default configuration can be permanent. In this case, we can just make a dictionary with the necessary groups and extensions in the source code. In any case, the user can choose to use the default settings or not.

The best default categories A few short lists with well-known file extensions. Several lists of development-specific extensions because the intended audience is developers.

One global configuration file that lives in the corresponding user settings In the prototype, the configuration file is created in the same folder as the settings.py I thought to choose another folder where it will be stored later. For this file, we can create a folder in the user's home directory. os.environ['HOME']/count_files/1.6.0/count_files.ini This will allow us to save user preferences when updating or reinstalling the package.

We can use one file in which the user can specify settings for a specific path.

import configparser

config = configparser.ConfigParser(interpolation=configparser.ExtendedInterpolation())

# one global configuration file
# any list/lists in [DEFAULT]

# you can use it all at once
# without sections
simple_config = """
[DEFAULT]
audio = mp3, mid, midi
videos = avi, flv
images = bmp, png, svg
misc = other, ext
"""
# or with sections
advanced_config = """
[DEFAULT]
audio = mp3, mid, midi
videos = avi, flv
images = bmp, png, svg
misc = other, ext
[home]
path = /home/username
extensions = py, pyc, pyw, ${DEFAULT:audio}
[folder]
path = /home/username/folder
extensions = txt, db, md, jpg
"""

args_path = '/home/username'

config.read_string(advanced_config)

res = {}

# search for selected extensions in Counter
# one header, e.g. 'selected'
sections = config.sections()
if sections:
    for i in config.sections():
        if config[i]['path'] == args_path:
            res.update({'selected': config.get(i, 'extensions')})
            break

# search for selected extension groups in Counter
# keys are needed for headings
if not res:
    for k, v in config['DEFAULT'].items():
        res.update({k: v})
print(res)

Possible solution A custom configuration file can be created upon request. For example: count-files -g create creates a basic template with examples and description. If for some reason we cannot create the file, we can keep the example of count_files.ini in the repository on GitHub.

count-files -g read reads and displays the default dictionary from source code and custom configuration file (if exists).

count-files -g

if a custom configuration file exists:
    use extensions/groups from this file via ConfigParser module
else:
    use dictionary from source code with default groups

In addition, we can add a value that allows to use the default groups and user groups together.

The purpose of using the custom config file:

External dependencies The configparser and os modules are sufficient for this option. I do not really want to use third-party libraries. Have you heard about the left-pad(js library) incident?


If the user presses Enter

Could it be better to use explicit confirmation of the action? Everything except answer.lower() in ['y', 'yes'] counts as No.

I am not sure that accepting a list of types is the best behaviour.

Yes, it’s better to type less text.

I would say that on the first time the application is run with the -cfg/--config-file option the application tell the user something like this:

Is it possible to ask this question once (technically) and without adding another argument?

Edit: -g and -cfg Also, what are the other advantages of two separate arguments for sorting?

victordomingos commented 4 years ago

I am not sure about it, since it may not follow the platform convention on where configuration files should be placed. If a user has 100 applications and each one adds a configuration folder at ~/, it gets kind of polluted. The Config-path package I mentioned above was an attempt to create a cross platform abstraction that would make easier to respect each platform's convention. Reading through its description, it seems that on Windows it's not just a matter of specifying a path, since the correct path must be obtained from a Windows API call. I agree that in this application we should try to stay away from third party packages (yes, I have read about that left-pad incident, and it's precisely what we should try to prevent). Currently, I am not able to dig into this subject as much as it deserves, but I will be reviewing eventual PRs as usual, as soon as I manage to do so.

-g / --group -> use the default group, loaded from our dictionary in the source code. -s / --settings -> use the users' settings file, if it exists at PATH. If it doesn't exist, ask the user if a new settings file, containing some examples/instructions, should be created at that location.

So, two separate arguments, one for default grouping, and the other for loading user settings from file. These are two distinct operations, and using separate arguments allows for future expansion on user settings. For instance, the settings file could include some configurations for tables or for other default parameters.

[Y/n] -> My idea was a simple y/Y or N/n for yes or no, but we may also accept yes/no. The upper case Y implies that it is the default option (the one that is assumed when the user just presses ENTER). We can opt to make No the default option, though, or even add some more explanatory text that clearly states the happens if the user presses Enter with no answer. Maybe No would be a better default option indeed.

NataliaBondarenko commented 4 years ago

User settings and Platform convention on where configuration files should be placed

-s/--settings The future expansion of user preferences is an interesting option. The settings file may also contain user-defined supported types. On Windows, os.environ['HOME'] seems to be the usual place for such files. I have settings, logs, cache from different applications there. The file location can be selected depending on which OS the program is installed on. We can think about it later. Сustom settings may be done in the next version. Now I want to start doing at least something with a file preview.

victordomingos commented 4 years ago

Seems ok to me.