python / cpython

The Python programming language
https://www.python.org
Other
62.82k stars 30.09k forks source link

long <-> byte-string conversion #40083

Closed bbfdc1bb-cfb1-4936-9d28-559fa9fd3b52 closed 16 years ago

bbfdc1bb-cfb1-4936-9d28-559fa9fd3b52 commented 20 years ago
BPO 923643
Nosy @josiahcarlson, @mdickinson, @pitrou, @tiran
Superseder
  • bpo-1023290: Conversion of longs to bytes and vice-versa.
  • Files
  • base256.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/mdickinson' closed_at = created_at = labels = ['interpreter-core'] title = 'long <-> byte-string conversion' updated_at = user = 'https://bugs.python.org/trevp' ``` bugs.python.org fields: ```python activity = actor = 'mark.dickinson' assignee = 'mark.dickinson' closed = True closed_date = closer = 'mark.dickinson' components = ['Interpreter Core'] creation = creator = 'trevp' dependencies = [] files = ['5897'] hgrepos = [] issue_num = 923643 keywords = ['patch'] message_count = 19.0 messages = ['45671', '45672', '45673', '45674', '45675', '45676', '45677', '45678', '45679', '45680', '45681', '45682', '45683', '45684', '59827', '59868', '60134', '61809', '62878'] nosy_count = 6.0 nosy_names = ['phr', 'josiahcarlson', 'mark.dickinson', 'trevp', 'pitrou', 'christian.heimes'] pr_nums = [] priority = 'normal' resolution = 'rejected' stage = None status = 'closed' superseder = '1023290' type = None url = 'https://bugs.python.org/issue923643' versions = ['Python 2.6'] ```

    bbfdc1bb-cfb1-4936-9d28-559fa9fd3b52 commented 20 years ago

    Sometimes you want to turn a long into a byte-string, or vice-versa. This is useful in cryptographic protocols, and probably other network protocols where you need to exchange large integers.

    In 2.4, you can handle unsigned longs with this:

    def stringToLong(s):
        return long(binascii.hexlify(s), 16)
    
    def longToString(n):
        return binascii.unhexlify("%x" % n)

    However, these functions are slower than they need to be, they're kinda kludgey, and they don't handle negative values.

    So here's a proposal:

    def stringToLong(s):
        return long(s, 256)
    
    def longToString(n):
        return n.tostring()

    These functions operate on big-endian, 2's-complement byte-strings. If the value is positive but has its most- significant-bit set, an extra zero-byte will be prepended. This is the same way OpenSSL and (I think) GMP handle signed numbers.

    These functions are \~5x faster than the earlier ones, they're cleaner, and they work with negative numbers.

    If you only want to deal with unsigned positive numbers, you'll have to do some adjustments:

    def stringToLong(s):
        return long('\0'+s, 256)
    
    def longToString(n):
        s = n.tostring()
        if s[0] == '\0' and s != '\0':
            s = s[1:]
        return s

    That's not ideal, but it seems better than any interface change I could think of.

    Anyways, the patch adds this to longs. It should be added to ints too, and I guess it needs tests etc.. I can help with that, if the basic idea is acceptable.

    Trevor

    91e69f45-91d9-4b12-87db-a02908296c81 commented 20 years ago

    Logged In: YES user_id=72053

    I think those funcs should take an optional extra arg to say you want unsigned. That's cleaner than prepending '0'. In cryptography you usually do want unsigned.

    bbfdc1bb-cfb1-4936-9d28-559fa9fd3b52 commented 20 years ago

    Logged In: YES user_id=973611

    You're right, we should support unsigned strings somehow.
    Adding another argument to the int() and long() constructors would be messy, though. How about:

    n = long(s, 256) #unsigned
    n = long(s, -256) #signed
    
    n.tounsignedstring()
    n.tosignedstring()

    The "-256" thing is a hack, I admit.. but it kinda grows on you, if you stare it at awhile :-)...

    bbfdc1bb-cfb1-4936-9d28-559fa9fd3b52 commented 20 years ago

    Logged In: YES user_id=973611

    You're right, we should support unsigned strings somehow.
    Adding another argument to the int() and long() constructors would be messy, though. How about:

    n = long(s, 256) #unsigned
    n = long(s, -256) #signed
    
    n.tounsignedstring()
    n.tosignedstring()

    The "-256" thing is a hack, I admit.. but it kinda grows on you, if you stare it at awhile :-)...

    91e69f45-91d9-4b12-87db-a02908296c81 commented 20 years ago

    Logged In: YES user_id=72053

    How about just punting signed conversion. I don't think two's complement makes much sense for arbitrary precision longs. Have some separate representation for negative longs if needed. If you call hex() on a large negative number, you get a hex string with a leading minus sign. For base 256, you can't reserve a char like that, so I guess you have to just throw an error if someone tries to convert a negative long to a string. If you want a representation for signed longs, ASN1 DER is probably an ok choice. I agree with Guido that the binascii module is a good place to put such a function. Twos complement can work if you specify a fixed precision, but that sure complicates what this started out as.

    bbfdc1bb-cfb1-4936-9d28-559fa9fd3b52 commented 20 years ago

    Logged In: YES user_id=973611

    I think 2's complement makes good sense for arbitrary precision longs. This is how OpenSSL and GMP handle them.
    It's also how the ASN.1 BER/DER encodings handle integers: these encodings just prepend tag and length fields to the big- endian 2's complement value. I.e.: If you want to extract RSA public values from an X.509 certificate, they'll be in 2's complement (well, they'll always be positive... but they'll have an extra zero byte if necessary).

    Since the functionality for 2's complement is already in the C code it's easy to expose through a patch. So I'm still in favor of presenting it.

    bbfdc1bb-cfb1-4936-9d28-559fa9fd3b52 commented 20 years ago

    Logged In: YES user_id=973611

    My last comment was wrong: GMP's raw input/output format uses big-endian positive values, with the sign bit stored separately.

    bbfdc1bb-cfb1-4936-9d28-559fa9fd3b52 commented 20 years ago

    Logged In: YES user_id=973611

    My last comment was wrong: GMP's raw input/output format uses big-endian positive values, with the sign bit stored separately.

    ef6b2a61-f027-4805-a66a-cde4eee277c3 commented 20 years ago

    Logged In: YES user_id=341410

    I'm curious to know if anyone would object to optional minimum or maximum or both arguments, or even some additional methods that would result in a potentially constrained string output from long.tostring()?

    If I were to split the functionality into three methods, they would be as follows...

    def atleast(long, atl):
        if atl < 0:
            raise TypeError("atleast requires a positive integer
    for a minimum length")
        a = long.tostring()
        la = len(a)
        return (atl-la)*'\o' + a
    
    def atmost(long, atm):
        if atm < 0:
            raise TypeError("atleast requires a positive integer
    for a minimum length")
        a = long.tostring()
        la = len(a)
        return a[:atm]
    
    def constrained(long, atl, atm):
        if atm < atl:
            raise TypeError("constrained requires that the
    maximum length be larger than the minimum length")
        if atl < 0 or atm < 0:
            raise TypeError("constrained requires that both
    arguments are positive")
        a = long.tostring()
        la = len(a)
        return ((atl-la)*'\o' + a)[:atm]

    I personally would find use for the above, would anyone else have use for it?

    bbfdc1bb-cfb1-4936-9d28-559fa9fd3b52 commented 20 years ago

    Logged In: YES user_id=973611

    Uploading a new patch (base256.diff). This implements only the string-> long (or int) conversion. It adds support for radix 256 (unsigned) or -256 (2's-complement signed) to the int() and long() built-ins: int("\xFF\xFF\xFF", 256) -> 0xFFFFFF int("\xFF\xFF\xFF", -256) -> -1 long(os.urandom(128), 256) -> 1024-bit integer

    I left out the long -> string conversion. If python adds a bytes() type, then that conversion could be done as bytes(long). This patch has docs and tests.

    ef6b2a61-f027-4805-a66a-cde4eee277c3 commented 20 years ago

    Logged In: YES user_id=341410

    As an aside, I've requested similar functionality be added to the struct module, which handles signed, unsigned, big-endian, little-endian, and integers of arbitrary length. The request is available here: https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1023290&group_id=5470

    Would you prefer two functions that live as methods of ints and longs, or would you prefer the functionality be placed inside of the struct module (which already does numeric packing and upacking), and had its own type code?

    bbfdc1bb-cfb1-4936-9d28-559fa9fd3b52 commented 20 years ago

    Logged In: YES user_id=973611

    Thanks for the pointer. I'm actually not smitten with any of the approaches for long\<->byte-string conversion (mine included). Now that I think about it, the best interface might be if Python had a 'bytes' type, then both conversions could be done by constructors: long(bytes) bytes(long)

    If Python is going to grow such a type in the future, maybe we can defer this question, and live with the hexlifying and whatnot, till then.

    91e69f45-91d9-4b12-87db-a02908296c81 commented 20 years ago

    Logged In: YES user_id=72053

    There shouldn't be a bytes type. Those are just strings. Instead I think it's better to add a conversion function to the strings module or the binascii module. The hex thing is a crock and should be exterminated.

    ef6b2a61-f027-4805-a66a-cde4eee277c3 commented 20 years ago

    Logged In: YES user_id=341410

    Paul, it seems as though there exists a Python function in pickle that does conversions: encode_long() and decode_long(), though depending on the size of your longs, a custom version may be faster (I wrote one that doesn't use hex, and manages to be 20% faster on integers less than 32 bits).

    It also appears that Raymond is going to be placing such a pair of functions in binascii.

    I'm curious as to whether you prefer a variable-width return as provided by pickle.encode_long, or a fixed-width return as provided by the struct method I propose.

    tiran commented 16 years ago

    Here is another integer related optimization patch from Trev. I've no opinion on the matter but you might be interested in it as well.

    pitrou commented 16 years ago

    Unless I'm mistaken, the patch provides only half of the solution: the stringToLong part, but not longToString. I agree this feature is interesting, not for optimization but becomes it avoids using clunky ways (long -> hex -> bin) to implement something simple. However, perhaps making it part of either the binascii or the struct module would allow more formatting choices (e.g. big-endian or little-endian, fixed-length or not). For example in cryptography you would want a byte string of a fixed size even if your 512-bit number happens to have its 8 most significant bits set to zero.

    mdickinson commented 16 years ago

    It seems to me that this issue is almost entirely subsumed by issue bpo-1023290. I'm quite tempted to close this and direct further discussion there---personally, I'd support Josiah's proposed struct addition.

    Paul: if you're still listening after all this time, is it true that between them, Josiah's proposal and pickle.{encode_long, decode_long} would cover your needs?

    mdickinson commented 16 years ago

    Closing this due to little-or-no activity for over three years. Antoine's use case seems to me to be covered by issue bpo-1023290.

    Trevor, please speak up if you want to keep this alive.

    mdickinson commented 16 years ago

    pending -> closed.