sharkdp / hexyl

A command-line hex viewer
Apache License 2.0
9.07k stars 227 forks source link

Units of measurement of bytes? #44

Closed ErichDonGubler closed 4 years ago

ErichDonGubler commented 5 years ago
Unit Implemented? Description Examples Suggested implementation
Decimal [x] A decimal integer, which is equivalent to specifying a single byte unit for the count. 23, 1024 u64::from_str(...)
Hex [x] Implemented in #45. A hexadecimal integer. Specified with a leading 0x. 0x17, 0x100 u64::from_str_radix(...)
Blocks [ ] A single block, which is by default 512 bytes but configurable via config flag. -b 512 -n 1block

N.B: one cannot use a block unit to define the block size.
Add a flag to optionally define block size, then check for a trailing block when parsing numbers. Multiply by block size.
Bytes [ ] A byte size familiar to most IT professionals. Specified by B at the end of the count, and can include an optional magnitudinal spec like kilobytes (K) or megabytes (M).
  • 23B: 23 bytes
  • 9KB: 9 kilobytes
Implement a regex of the form (?P<count>\d+)(?P<magnitude_unit>[KM]?)B.

Other open questions

ErichDonGubler commented 5 years ago

I've added a description to this issue! :)

Looks like #45 handles hex, so I'll note that.

sharkdp commented 5 years ago

Using b to mean block could be confusing if we also add (explicit) support for byte/B, kilobyte/kB, etc. I'd rather go with block as a unit.

ErichDonGubler commented 5 years ago

@sharkdp: Okay, so when you say "b" are you referring to the short name of the flag specifying block size, the unit used in a size-typed option, or both? I'm going to assume the second, but just wanted to make sure I'm not missing an ambiguity there. :)

sharkdp commented 5 years ago

Okay, so when you say "b" are you referring to the short name of the flag specifying block size,

no, using -b to mean --block-size is fine for me.

the unit used in a size-typed option

yes. I think we should reserve --length 64b to mean "64 bytes" instead of "64 blocks" because we might also add --length 64kB or --length 64kb.

ErichDonGubler commented 5 years ago

So, if I were to model the byte units with a regex, I'm thinking something like:

(?P<count>\d+)(?P<magnitude_unit>[kKmM]?)(?P<bits_or_bytes>[bB])

Questions/clarifications about the above:


For block, I'm just thinking of accepting:

(?<count>\d+)blocks?

Questions about this one:

ErichDonGubler commented 5 years ago

I've updated the OP with conservative requirements for now.

aswild commented 4 years ago

@sharkdp - it's been over a year, but would you be interested in a PR for some flavor of this feature?

I propose supporting roughly the same set of suffixes as GNU coreutils programs like head or dd:

Custom block sizes could be added as discussed above. I don't see much use adding bits as a unit since a hexdump is fairly byte-centric.

ErichDonGubler commented 4 years ago

@aswild: I started working on this yesterday, I have a branch that I'll be sending as a PR either today or tomorrow hopefully.

aswild commented 4 years ago

ah cool, I wrote my own version in https://github.com/aswild/hexyl/commit/1c116b0be764080835b7911c69e77fcd8e4e0bfb and https://github.com/aswild/hexyl/commit/9191489ec2412554927111bdba15ae64936d4f81 too, just haven't squashed them to a PR branch

sharkdp commented 4 years ago

@aswild @ErichDonGubler Thank you very much for your work on this.

A few comments / questions:

ErichDonGubler commented 4 years ago
  • Allowing case-insensitive parsing seems okay to me, as there is no room for ambiguity (I think) - because we only care about the "byte" unit. In my Insect general purpose scientific calculator, I would not allow things like mB (would be parsed as "millibyte") or KB (could be confused with Kelvin · Byte). But for hexyl, I guess it's fine.

This doesn't seem worth discussing to death to me, so no pressure from me one way or another here. I did it because it was a straightforward thing to add and the user experience seemed to outweigh that of case sensitivity.

  • How about the k, M, G, and T shorthand notation for multiples of 2^10? Isn't it really confusing that k = 2^10 byte but kB = 10^3 byte? Can we leave this notation out? Or is it too common in other Unix tools?

How about we pull this into a separate PR for discussion? To kickstart that discussion: We could defer in forwards-compatible ways. I see a few strategies that may actually be orthogonal approaches:

I think in general hexyl doesn't have very good diagnostics for its interface right now. Invalid units, for instance, get totally thrown away if they don't parse correctly, which I consider a subpar user experience at best. Mind if I open another issue about diagnostics in general?

  • #90 proposes to add a -b short option. I didn't think about this previously, but hexdump has a -b option that has a completely different meaning. Should we maybe keep -b free for now?

Sure, I can pull this out.

sharkdp commented 4 years ago

This doesn't seem worth discussing to death to me, so no pressure from me one way or another here. I did it because it was a straightforward thing to add and the user experience seemed to outweigh that of case sensitivity.

:+1:

* We could start with the most strict user interface and relax as we decide what's acceptable and give ourselves enough time to consider the design space? Perhaps I've been lurking on `rustc` development for too long, heh...

Sounds good to me. Let's leave these shorthand notations out for now.

Sure, I can pull this out.

:+1:

I think in general hexyl doesn't have very good diagnostics for its interface right now. Invalid units, for instance, get totally thrown away if they don't parse correctly, which I consider a subpar user experience at best. Mind if I open another issue about diagnostics in general?

Absolutely. I didn't know that and would consider it a bug. Looks like the error handling can be improved a lot.

ErichDonGubler commented 4 years ago

Let's leave these shorthand notations out for now.

Sounds good. Did you want to create a separate issue for {k,m,g,t} to be discussed?

sharkdp commented 4 years ago

Sounds good. Did you want to create a separate issue for {k,m,g,t} to be discussed?

I'd rather wait until someone complains that they are missing :smile:. I personally like it better without them.

sharkdp commented 4 years ago

closed via #90 by @ErichDonGubler

sharkdp commented 4 years ago

Released in v0.8.0.