sflanaga / du2

2 stars 0 forks source link

du2 0.2.2 Fast parallel file system lister / usage statistics summary

Latency vs throughput: The theory here is that parallel listing overcomes latency issues on remote files systems by having multiple requests in play at once. Usually remote file systems capable of good throughput will have higher latency than local file systems largely because the OS owns a faster and exclusive cache to local file system metadata.

Each "opendir" is finished to completion so that the directory open time is minimized, but this costs more memory than straight recursion. This might also contribute to better performance as it may reduce contention on that remote file system versus holding the opendir open as you recurse a directory's children. Sub directories found are queued for other threads to query to completion, and therefore because the number of directories may be large the queue grows unbounded. The queue must be unbounded or a deadlock can occur as the worker is also a master (creator of new work).

Because in this application directories are evaluated in no particular order, it is necessary to aggregate lower directories up the tree containing ALL directories for usage summaries. This tree is the bulk of the memory used and is proportional to the tree directory count.

Symbolic links are not followed

USAGE:
    du2 [OPTIONS] <DIRECTORY>

OPTIONS:
    -d, --delimiter <delimiter>                
            Disk usage mode - do not write the files found [default: |]

        --die-in <die-in>                      
            write cpu time consumed by each thread

        --exclude-re <exclude-re>              
            Exclude paths that match this RE

            RE matching is done on the whole path for both directories and files.  Paths are not canonicalized.
        --file-newer-than <file-newer-than>    
            Only count/sum entries newer than this age

        --file-older-than <file-older-than>    
            Only count/sum entries older than this age

    -h, --help                                 
            Prints help information

    -l                                         
            Write file list

        --extra                                

    -t, --worker-threads <no-threads>          
            Number worker threads

            defaults to 0 which means # of cpus or at least 4 [default: 0]
        --progress                             
            Writes progress stats on every ticker interval

        --re <re>                              
            Keep only paths that match this RE

            Note that this can be used with the exclude_re, but this one is checked first and then the other if set. RE
            matching is done on the whole path for both directories and files.  Paths are not canonicalized.
        --write_thread_status                  
            Writes thread status every ticker interval - used to debug things

        --t-status-on-key                      
            Writes thread status when stdin sees a line entered by user

    -i, --ticker-interval <ticker-interval>    
            Interval at which stats are written - 0 means no ticker is run [default: 200]

    -n <top-n-limit>                           
            Report top usage limit [default: 10]

    -u, --usage-trees                          
            Write disk usage summary

    -V, --version                              
            Prints version information

    -v                                         
            Verbosity - use more than one v for greater detail

        --write-thread-cpu-time                
            write cpu time consumed by each thread

ARGS:
    <DIRECTORY>    
            Directory to search