kbandla / dpkt

fast, simple packet creation / parsing, with definitions for the basic TCP/IP protocols
Other
1.09k stars 270 forks source link

Review relative imports #501

Open brifordwylie opened 3 years ago

brifordwylie commented 3 years ago

The current dpkt package seems oddly setup IMO. Issues I see/struggle with...

1) The __init__.py pulls in the universe, I like the simplicity of having an empty __init__.py that does nothiing Note: After folks have reviewed/talked about it looks like for a couple reasons we'll just leave the __init__.py as is.

2) The dpkt.py file thats in dpkt/dpkt/dpkt.py means there's a naming conflict when trying to import the dpkt package 3) The code is full of relative imports which means you can't just run a file/test with $ python <blah.py>

Would be nice if @kbandla, @obormot and I could maybe meet up and discuss how to go forward here, this feels like a fairly big issue IMO but we'd need to talk/agree/team up to make the changes.

Resources:

kbandla commented 3 years ago

Regarding # 1 (init pulling in the universe)

To test without imports in __init__:

Results

Make sure dpkt/__pycache__ is deleted before each test run. Then I ran:

time python print_packets.py

Time with imports (first run):

real    0m0.422s
user    0m0.000s
sys     0m0.000s

Time with imports (__pycache__):

real    0m0.244s
user    0m0.000s
sys     0m0.015s

Time without the imports (first run):

real    0m0.195s
user    0m0.000s
sys     0m0.015s

Time without imports (__pycache__):

real    0m0.196s
user    0m0.000s
sys     0m0.000s

I should probably run them a 100 times and then avg, but i did it a few times and the results were ~ the same.

IMO - There is a little speed up, that might be significant for someone processing pcaps all day, everyday. But the change would break existing scripts because statements like: import dpkt would have to be updated to explicitly import all submodules the scripts needs (Ex: import dpkt.ethernet etc)

brifordwylie commented 3 years ago

@kbandla agreed the changes (depending on what we decide to do) might break existing scripts, so we'll have to proceed with caution. So, for me, 2) and 3) are the most annoying.. so perhaps we can leave 1) to avoid script breakage. Although as part of this process we should definitely review exactly what __init__.py is doing.. for instance.. I have no idea what this is doing :p

# Note: list() is used to get a copy of the dict in order to avoid
# "RuntimeError: dictionary changed size during iteration"
# exception in Python 3 caused by _mod_init() funcs that load another modules
for name, mod in list(sys.modules.items()):
    if name.startswith('dpkt.') and hasattr(mod, '_mod_init'):
        mod._mod_init()

@obormot you know what this is doing?

obormot commented 3 years ago

@obormot you know what this is doing?

Yeah _mod_init()s get called after all dpkt protocol modules have initialized (ie. got imported into the global namespace I guess is another way to put it). Then things like Ethernet._typesw can be propagated so automatic decoding works with lower network layers being aware of higher layers and vice versa.

brifordwylie commented 3 years ago

Ah, okay.. so that's another good reason to leave __init__.py as it is... okay.. so for 1) we'll just skip. That leave 2) and 3)... I'll try to play around with those a bit and get a sense for what changes those might look like.

obormot commented 3 years ago

So just to add some context, while I don't know the most kosher way of dealing with relative imports, I've tried applying the following trick to several modules:

if __package__:
    from . import dpkt
    from .compat import iteritems
else:
    import dpkt
    from compat import iteritems

This allows running module-level unit tests individually, like

$ python stp.py
Tests Successful...

However, tests that depend on the other module initializations (discussed above) do fail in this case.

$ python ethernet.py
Traceback (most recent call last):
  File "ethernet.py", line 823, in <module>
    test_eth()
  File "ethernet.py", line 426, in test_eth
    assert isinstance(eth.data, ip6.IP6)
AssertionError

Perhaps this could be fixed by adding something like from dpkt.all import * that will run all inits (within the unit tests that require the full stack initialization), where all.py will replace the current dpkt/__init__.py. Anyway, these are just some ideas.

crocogorical commented 3 years ago

I think the _mod_init functions result in an overly complex initialization chain for dpkt.

If we moved to static mapping dictionaries between protocol numbers and their handler classes, there would be no need for the module initialization. This would involve hard-coding the mappings, but I think may lead to more easily understandable code.

Once python2 is dropped (#551), we can have a much cleaner implementation of how all of this works. I've made a quick first pass in https://github.com/kbandla/dpkt/commits/feature/absolute_imports. If python 2.7 is dropped from the test environments, all tests pass.

Combining these two things would mean we could have a completely empty __init__.py.

One area where speeding up dpkt init is as part of use in a serverless environment (like an AWS Lambda).

brifordwylie commented 3 years ago

Looping in (@obormot @kbandla )

This is one of those areas that was setup before I became involved, and I never really understood the current setup.

Also, I have to admit, my knowledge/experience with 'meta programming' is basically nothing. The Packet(_MetaPacket) code in dpkt.py kinda blows my mind. It's obviously super interesting/powerful, but I would have to stare at it for a LONG time to really understand what it's doing....

@crocogorical what's your understanding of Python Meta Programming? Do you feel like you have a good understanding/knowledge of the Packet() and _MetaPacket() classes?

Also, just as a logistical concern, lets get a few more short term items knocked off the list (I'll do the 1.9.5 release).. and put this ticket/issue perhaps as a goal for 2.0 release.

crocogorical commented 3 years ago

I've been looking at the MetaPacket setup, and it took me a while to understand it, but I think I'm there. There are some improvements that we can make for efficiency too.

My understanding is that when a class is initialized (the class, not an instance of it, so during import of dpkt), say Ethernet, we follow the inheritance chain to Packet, which then does some stuff with the _MetaPacket. This becomes confusing because _MetaPacket is never initialized.

_MetaPacket receives 3x arguments: clsname (Ethernet), clsbases (the classes that this new class will inherit from, dpkt.Packet), and clsdict (all of the existing class attributes which we have defined). I've attached a file from some of my debugging efforts which inserts a print statement just after the __new__ definition in _MetaPacket showing the creation of Ethernet with clsname, clsbases, clsdict. It's from a branch, so has a couple more attributes in it) - file.txt

During the execution of __new__ in _MetaPacket, there are two important lines: Line 1 and Line 2. Both lines appear to do the same thing. This is the call to the stdlib function which actually performs the class creation. Between the two lines, we update clsdict if there is a __hdr__ element, and then init again. This code appears to be an attempt to extend the use of __slots__ down to the objects (which would be a good thing). Unfortunately this implementation does not appear to be complete. For __slots__ to truly work, you must maintain an unbroken chain of classes which implement __slots__, all the way from object down to the final class. If a class implements __slots__, but its superclass does not, then it does nothing.

I had a play with trying to extend this further, with some success, but the __slots__ definitions in the subclasses become very long for classes like Ethernet, which names attributes based on the next layer this line, where if the next layer is IP, we set the .ip attribute, as well as .data. I've never used those attributes, and all of the example code I have seen appears to use .data. Perhaps they could be removed? This would obviously be a breaking change for anyone making use of them, so we would probably want to introduce it in a major version release (like 2.0) if going that way.

With _MetaPacket, we produce (in my opinion), a slightly confusing hierarchy, as _MetaPacket is never actually initialized. Instead the inheritance chain is from the arguments that Packet passes, which "Temp" is the named class. _MetaPacket is a class factory. This results in a method resolution chain of:

>>> from dpkt.ethernet import Ethernet
>>> Ethernet.mro()
[<class 'dpkt.ethernet.Ethernet'>, <class 'dpkt.dpkt.Packet'>, <class 'dpkt.dpkt.Temp'>, <class 'object'>]

We should probably change Temp to something like BasePacket just for readability. I have a theory that _MetaPacket could probably be incorporated into Packet, but haven't explored that further.