nskinkel / oppy

A Tor client (onion proxy) implementation in Python
BSD 3-Clause "New" or "Revised" License
24 stars 3 forks source link

Clean up cell code #51

Closed nskinkel closed 9 years ago

nskinkel commented 9 years ago

Code for cells is getting unwieldy and confusing. We should really clean this up soon as it will lead to lots of errors down the road if we're doing subtle things wrong and/or it's hard to use. Additionally, representing cells should be one of the easiest things we're doing, so this should be pretty simple.

dwtj commented 9 years ago

As of commit 63251f1, the cell code is now much better, however it still has its warts. The two main improvements are

  1. the use of namedtuples, and
  2. the simplified cell helper functions in cell.util.

Also, I don't think that our experience demonstrates that cells are not necessarily "easy" or "simple". Representing them some way isn't the problem. Rather, the processes correctly building and parsing them and putting them to use are what is/was a challenge.

I think that this issue no longer requires "bug" status. However, I think that we can keep it open as an "enhancement" for ideas about how to make further improvements.

dwtj commented 9 years ago

After working with the code over the weekend, here are some of my thoughts on ways to improve the representation in the long term (i.e. over christmas break).

It was a mistake to make cells represented by a single "flat" namedtuple rather than a namedtuple that points to a few namedtuples and maybe some other items. Currently, all of the cells are basically a single flat struct. For example, a RelayExtend2Cell has the following cell fields.

In [1]: import oppy.cell.relayextend2cell as re2cell
In [2]: re2cell.RELAY_EXTEND2_CELL_FIELDS
Out[2]:
('circuit_id',
 'cmd',
 'payload_len',
 'payload',
 'relay_cmd',
 'recognized',
 'stream_id',
 'digest',
 'rpayload_len',
 'rpayload',
 'lspecs',
 'htype',
 'hdata')

This can really cause problems if we find that we need to change the representation of a cell header or a cell relay header. One such change that is on the (after-semester) horizon is the addition of a field to indicated the size of the cell's circuit_id (i.e. 2 bytes or 4 bytes).

After working with them, this representation is really rather brittle and it causes a lot of code duplication. I propose that this information be revised into a two-level heirarchy. Here is what one of the most complicated examples, RelayExtend2Cell might look like:

- my_relay_extend2_cell

    - bytes  # The bytes representation of the whole cell.

    - header
        - bytes    # The bytes/str representation of the header.
        - circ_id
        - circ_id_size
        - cmd
        - payload_len  # To indicate that this is a var-length cell this could
                       #   either be set to `None` (like now) or just not be there
                       #   at all, and a helper function can check `hasattr` to
                       #   find out whether a function has this field or not.
    - payload
        - bytes        # The bytes/str representation of just the payload.

    - rheader
        - bytes        # The bytes/str representation of just the `rheader`.
        - relay_cmd
        - recognized
        - stream_id
        - digest
        - rpayload_len

    - rpayload
        - bytes   # The bytes/str representation of just the `rpayload`
        - lspecs  # These are cell-specific fields. They should be *useful*
        - htype   #   things which have been parsed, like python lists of
        - hdata   #   `ipaddress` objects.

It may be that we don't need to have the duplicate bytes representations. Having the bytes wouldn't be so bad if the sub-field bytes were actually references to slices of the whole cell's bytes (i.e. no data duplication, just different interfaces/view of the data).

A much simpler example such as NetInfoCell would look like:

- my_net_info_cell
    - header
        - bytes
        - circ_id
        - circ_id_size
        - cmd
        - payload_len
    - payload
        - bytes
        - dest_addr
        - src_addrs
        - timestamp

I think that making this change will make the code

dwtj commented 9 years ago

I've found a couple of tools that we might use to easily implement packing/unpacking code declaratively:

I haven't looked into the specifics of either of these yet, but it seems like Google's Protocol Buffer would be the better choice because of its track-record of robustness and performance. Furthermore, if we implement the cells in the .proto format, other projects (i.e. non-python projects) can use them. In fact, the .proto files might already have been implemented by someone else.

Construct2 on the other hand might be slightly more pythonic and maybe more flexible. However, I expect that the performance won't be as good as Protocol Buffers.

dwtj commented 9 years ago

I suggest that we start the potential cell-revision process by looking at Protocol Buffers to see if they will suit our needs. (Fingers crossed, the revision might be easy and remove a ton of our project's complexity.)

nskinkel commented 9 years ago

done as of merge 35ecfcc6bb