swcarpentry / shell-novice

The Unix Shell
http://swcarpentry.github.io/shell-novice/
Other
392 stars 979 forks source link

Does `ls -s` depend on the version of bash? #1302

Closed nbehrnd closed 2 years ago

nbehrnd commented 2 years ago

observation:

This question refers to running ls -s as introduced in episode 2 on the pristine uncompressed test data fetched from here (again attached below). Contrasting to the output the page currently reports

$ cd ~/Desktop/shell-lesson-data
$ ls -s exercise-data
total 28
 4 animal-counts   4 creatures  12 numbers.txt   4 proteins   4 writing

I observe an output of

~/Desktop/shell-lesson-data$ ls -s exercise-data/
total 20
4 animal-counts  4 creatures  4 numbers.txt  4 proteins  4 writing

where the size about numbers.txt differs. Interestingly, the thunar file manager reports a size of 13 bytes for this very file.

question:

Is the difference due to a different file numbers.txt, or/and an update of bash released after designing the lecture?

system used:

The observation refers to bash (GNU bash, version 5.1.16(1)-release (x86_64-pc-linux-gnu)) as provided from the repositories of Linux Debian 12/bookworm (branch testing) on a ext4 partition including thunar 4.16.10 (Xfce 4.16). The most recent system upgrade run earlier today, 2022-05-27. ls --version reports ls (GNU coreutils) 8.32 by 2020.

shell-lesson-data.zip

gcapes commented 2 years ago

Thanks for the issue!

This gets computer-science complicated beyond my understanding quite quickly, but I'll make a start on things.

From the man pages, we discover that ls -s displays size in blocks.

   -s, --size
         print the allocated size of each file, in blocks

Ok, so what's a block? The best explanation I have found is here on reddit which basically says a block is the smallest writable section of storage on a disk, and that if your file is smaller than the block size, the remaining storage capacity of the block gets blanked out.

So why is it different on your system? Well it appears that the block size is system-dependent. https://askubuntu.com/questions/641900/how-file-system-block-size-works

FWIW the file size is shown as 13 byte on my system too:

$ ls -lh exercise-data/
total 28K
drwxrwxr-x 2 mbexegc2 mbexegc2 4.0K Sep 16  2021 animal-counts
drwxrwxr-x 2 mbexegc2 mbexegc2 4.0K Sep 16  2021 creatures
-rw-rw-r-- 1 mbexegc2 mbexegc2   13 Sep 16  2021 numbers.txt
drwxrwxr-x 2 mbexegc2 mbexegc2 4.0K Sep 16  2021 proteins
drwxrwxr-x 2 mbexegc2 mbexegc2 4.0K Sep 28  2021 writing
nbehrnd commented 2 years ago

Thank you. When addressing the question I did not think in blocks (like in individual storage containers of fixed capacity each) rather than amount of information which may be split over multiple (on occasion only partly filled).

For me, this is a twofold reminder. For one, how hash sums like md5sum and sha256sum are computed; though there, if here is a block not filled to the rim with original data "to hash", the algorithms add (pad it with) zeroes. Because even submitting the single character A eventually yields a string of 65 characters (counting prior the separator) running

$ echo A | sha256sum
06f961b802bc46ee168555f066d28f4f0e9afdf3f88174c1ee6f9de004fc30a0  -
$ echo A | sha256sum | cut -d " " -f 1 | wc -c
65

For two, thinking in blocks bears similarities how e.g., Fortran vectorizes operations after data pass via RAM and caches into registers of a CPU. For an addition of two one-dimensional arrays, for example, instead of the individual computation on a(1) and b(1) to yield c(1) in one processing cycle (blue frame), the vectorized approach picks up multiple elements of array a and corresponding elements of array b (red frame) for multiple computations performed simultaneously. The pick for the next cycle of computation wouldn't only advance for one pair of elements (blue frame), but by the width the wider block (red frame):

Fortran_01

The width of this «blocks of advancement» depend on the processor's architecture (size/capacity of registers available) and the floating numbers representation (single, double, quadruple precision) during the computation. Yet overall, this block-wise progression may accelerate the computation beyond the one-by-one approach, independent to parameters like processor frequency, use of multiple cores per CPU, etc.

Fortran_02

Both images are (now edited) screenphotos taken during the class Fortran for Scientific Computing lead by Geert Jan Bex, Mag Selwa and Jan Ooghe.