Open brifordwylie opened 3 years ago
To test without imports in __init__
:
print_packets.py
script with two modifications to this line:
import dpkt.ethernet, dpkt.ip, dpkt.pcap
Make sure dpkt/__pycache__
is deleted before each test run. Then I ran:
time python print_packets.py
Time with imports (first run):
real 0m0.422s
user 0m0.000s
sys 0m0.000s
Time with imports (__pycache__
):
real 0m0.244s
user 0m0.000s
sys 0m0.015s
Time without the imports (first run):
real 0m0.195s
user 0m0.000s
sys 0m0.015s
Time without imports (__pycache__
):
real 0m0.196s
user 0m0.000s
sys 0m0.000s
I should probably run them a 100 times and then avg, but i did it a few times and the results were ~ the same.
IMO - There is a little speed up, that might be significant for someone processing pcaps all day, everyday.
But the change would break existing scripts because statements like:
import dpkt
would have to be updated to explicitly import all submodules the scripts needs (Ex: import dpkt.ethernet
etc)
@kbandla agreed the changes (depending on what we decide to do) might break existing scripts, so we'll have to proceed with caution. So, for me, 2) and 3) are the most annoying.. so perhaps we can leave 1) to avoid script breakage. Although as part of this process we should definitely review exactly what __init__.py
is doing.. for instance.. I have no idea what this is doing :p
# Note: list() is used to get a copy of the dict in order to avoid
# "RuntimeError: dictionary changed size during iteration"
# exception in Python 3 caused by _mod_init() funcs that load another modules
for name, mod in list(sys.modules.items()):
if name.startswith('dpkt.') and hasattr(mod, '_mod_init'):
mod._mod_init()
@obormot you know what this is doing?
@obormot you know what this is doing?
Yeah _mod_init()
s get called after all dpkt protocol modules have initialized (ie. got imported into the global namespace I guess is another way to put it). Then things like Ethernet._typesw
can be propagated so automatic decoding works with lower network layers being aware of higher layers and vice versa.
Ah, okay.. so that's another good reason to leave __init__.py
as it is... okay.. so for 1) we'll just skip. That leave 2) and 3)... I'll try to play around with those a bit and get a sense for what changes those might look like.
So just to add some context, while I don't know the most kosher way of dealing with relative imports, I've tried applying the following trick to several modules:
if __package__:
from . import dpkt
from .compat import iteritems
else:
import dpkt
from compat import iteritems
This allows running module-level unit tests individually, like
$ python stp.py
Tests Successful...
However, tests that depend on the other module initializations (discussed above) do fail in this case.
$ python ethernet.py
Traceback (most recent call last):
File "ethernet.py", line 823, in <module>
test_eth()
File "ethernet.py", line 426, in test_eth
assert isinstance(eth.data, ip6.IP6)
AssertionError
Perhaps this could be fixed by adding something like from dpkt.all import *
that will run all inits (within the unit tests that require the full stack initialization), where all.py
will replace the current dpkt/__init__.py
. Anyway, these are just some ideas.
I think the _mod_init
functions result in an overly complex initialization chain for dpkt.
If we moved to static mapping dictionaries between protocol numbers and their handler classes, there would be no need for the module initialization. This would involve hard-coding the mappings, but I think may lead to more easily understandable code.
Once python2 is dropped (#551), we can have a much cleaner implementation of how all of this works. I've made a quick first pass in https://github.com/kbandla/dpkt/commits/feature/absolute_imports. If python 2.7
is dropped from the test environments, all tests pass.
Combining these two things would mean we could have a completely empty __init__.py
.
One area where speeding up dpkt init is as part of use in a serverless environment (like an AWS Lambda).
Looping in (@obormot @kbandla )
This is one of those areas that was setup before I became involved, and I never really understood the current setup.
Also, I have to admit, my knowledge/experience with 'meta programming' is basically nothing. The Packet(_MetaPacket) code in dpkt.py kinda blows my mind. It's obviously super interesting/powerful, but I would have to stare at it for a LONG time to really understand what it's doing....
@crocogorical what's your understanding of Python Meta Programming? Do you feel like you have a good understanding/knowledge of the Packet() and _MetaPacket() classes?
Also, just as a logistical concern, lets get a few more short term items knocked off the list (I'll do the 1.9.5 release).. and put this ticket/issue perhaps as a goal for 2.0 release.
I've been looking at the MetaPacket setup, and it took me a while to understand it, but I think I'm there. There are some improvements that we can make for efficiency too.
My understanding is that when a class is initialized (the class, not an instance of it, so during import of dpkt), say Ethernet, we follow the inheritance chain to Packet, which then does some stuff with the _MetaPacket. This becomes confusing because _MetaPacket is never initialized.
_MetaPacket receives 3x arguments: clsname (Ethernet
), clsbases (the classes that this new class will inherit from, dpkt.Packet), and clsdict (all of the existing class attributes which we have defined). I've attached a file from some of my debugging efforts which inserts a print statement just after the __new__
definition in _MetaPacket showing the creation of Ethernet with clsname, clsbases, clsdict. It's from a branch, so has a couple more attributes in it) - file.txt
During the execution of __new__
in _MetaPacket, there are two important lines: Line 1 and Line 2. Both lines appear to do the same thing. This is the call to the stdlib function which actually performs the class creation. Between the two lines, we update clsdict if there is a __hdr__
element, and then init again. This code appears to be an attempt to extend the use of __slots__
down to the objects (which would be a good thing). Unfortunately this implementation does not appear to be complete. For __slots__
to truly work, you must maintain an unbroken chain of classes which implement __slots__
, all the way from object
down to the final class. If a class implements __slots__
, but its superclass does not, then it does nothing.
I had a play with trying to extend this further, with some success, but the __slots__
definitions in the subclasses become very long for classes like Ethernet
, which names attributes based on the next layer this line, where if the next layer is IP
, we set the .ip
attribute, as well as .data
. I've never used those attributes, and all of the example code I have seen appears to use .data
. Perhaps they could be removed? This would obviously be a breaking change for anyone making use of them, so we would probably want to introduce it in a major version release (like 2.0) if going that way.
With _MetaPacket, we produce (in my opinion), a slightly confusing hierarchy, as _MetaPacket is never actually initialized. Instead the inheritance chain is from the arguments that Packet passes, which "Temp" is the named class. _MetaPacket is a class factory. This results in a method resolution chain of:
>>> from dpkt.ethernet import Ethernet
>>> Ethernet.mro()
[<class 'dpkt.ethernet.Ethernet'>, <class 'dpkt.dpkt.Packet'>, <class 'dpkt.dpkt.Temp'>, <class 'object'>]
We should probably change Temp to something like BasePacket just for readability. I have a theory that _MetaPacket could probably be incorporated into Packet, but haven't explored that further.
The current dpkt package seems oddly setup IMO. Issues I see/struggle with...
1) The __init__.py pulls in the universe, I like the simplicity of having an empty __init__.py that does nothiing Note: After folks have reviewed/talked about it looks like for a couple reasons we'll just leave the
__init__.py
as is.2) The dpkt.py file thats in dpkt/dpkt/dpkt.py means there's a naming conflict when trying to import the dpkt package 3) The code is full of relative imports which means you can't just run a file/test with
$ python <blah.py>
Would be nice if @kbandla, @obormot and I could maybe meet up and discuss how to go forward here, this feels like a fairly big issue IMO but we'd need to talk/agree/team up to make the changes.
Resources: