nhairs / parsedmarc-fork

Experimental fork of PraseDMARC
https://nhairs.github.io/parsedmarc-fork/
Apache License 2.0
0 stars 1 forks source link

Performance Testing #2

Open nhairs opened 3 weeks ago

nhairs commented 3 weeks ago

It would be good for some "heavy" users of parsedmarc to start testing the fork.

Starting this issue as a place to aggregate the data / find out a bit more of the use-cases.

nhairs commented 3 weeks ago

Hi @rodpayne,

Given your PR, I was wondering if you'd be interested in testing this fork as it has a new implementation that /might/ improve efficiency with your use case (but also might not).

I'll note that this fork is probably best considered as alpha so if you'd like to wait till there's some more documentation that's also fine.

The one-day sample has 428 mail messages with 513,330 reports from 4/10/2024.

I was also wondering if you'd be willing to share some information about how you end up with such a low message count but very high report count.

rodpayne commented 2 weeks ago

I gave it a try and haven't yet figured out how to build parsedmarc. My first try, I got this:

 => ERROR [parsedmarc 3/8] COPY parsedmarc/ parsedmarc/                                                                                   0.0s
------
 > [parsedmarc 3/8] COPY parsedmarc/ parsedmarc/:
------
failed to solve: failed to compute cache key: failed to calculate checksum of ref 8097a7e0-404a-4949-80d6-becf106fb1f6::s7nxl91kini1d904q5umf84dg: "/parsedmarc": not found

I updated the Dockerfile to have COPY src/parsedmarc/ parsedmarc/ instead of COPY parsedmarc/ parsedmarc/, getting a little further, and then got:

 > [parsedmarc 7/8] RUN hatch build:
0.495 ──────────────────────────────────── sdist ─────────────────────────────────────
0.500 Setting up build environment
2.800 Traceback (most recent call last):
2.800   File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
2.800     return _run_code(code, main_globals, None,
2.800   File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
2.800     exec(code, run_globals)
2.800   File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/__main__.py", line 6, in <module>
2.800     sys.exit(hatchling())
2.800   File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/cli/__init__.py", line 26, in hatchling
2.800     command(**kwargs)
2.800   File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/cli/build/__init__.py", line 82, in build_impl
2.800     for artifact in builder.build(
2.800   File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/builders/plugin/interface.py", line 90, in build
2.801     self.metadata.validate_fields()
2.801   File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/metadata/core.py", line 265, in validate_fields
2.801     _ = self.version
2.801   File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/metadata/core.py", line 149, in version
2.801     self._version = self._get_version()
2.801   File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/metadata/core.py", line 248, in _get_version
2.801     version = self.hatch.version.cached
2.801   File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/metadata/core.py", line 1466, in cached
2.802     raise type(e)(message) from None
2.802 OSError: Error getting the version from source `regex`: file does not exist: src/parsedmarc/__init__.py
------
failed to solve: process "/bin/sh -c hatch build" did not complete successfully: exit code: 1
nhairs commented 2 weeks ago

As you've noticed the repository layout has changed parsedmarc -> src/parsedmarc, with the pyproject.toml updated to reflect the new structure.

I suspect that you should instead use this copy: COPY src/ src/

Though also I'll note, I've not been using docker/hatch to develop, I've just been using a virtual environment + pip install -e '.[dev]'

rodpayne commented 2 weeks ago

Thank you, that change worked. I am running a test with a day's worth of input emails.

While running for 11 hours so far, it has restarted from the beginning three times, with logs like the following:

2024-08-29 22:21:06    DEBUG:__init__.py:426:Processing message 56 of 369: UID AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONSAAA=
2024-08-29 22:21:24    ERROR:cli.py:1022:Mailbox Error
2024-08-29 22:21:24 Traceback (most recent call last):
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
2024-08-29 22:21:24     conn = connection.create_connection(
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 72, in create_connection
2024-08-29 22:21:24     for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/socket.py", line 954, in getaddrinfo
2024-08-29 22:21:24     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
2024-08-29 22:21:24 socket.gaierror: [Errno -3] Temporary failure in name resolution
2024-08-29 22:21:24 
2024-08-29 22:21:24 During handling of the above exception, another exception occurred:
2024-08-29 22:21:24 
2024-08-29 22:21:24 Traceback (most recent call last):
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 716, in urlopen
2024-08-29 22:21:24     httplib_response = self._make_request(
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 404, in _make_request
2024-08-29 22:21:24     self._validate_conn(conn)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1061, in _validate_conn
2024-08-29 22:21:24     conn.connect()
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 363, in connect
2024-08-29 22:21:24     self.sock = conn = self._new_conn()
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
2024-08-29 22:21:24     raise NewConnectionError(
2024-08-29 22:21:24 urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f4b2c0e1fa0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
2024-08-29 22:21:24 
2024-08-29 22:21:24 During handling of the above exception, another exception occurred:
2024-08-29 22:21:24 
2024-08-29 22:21:24 Traceback (most recent call last):
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 667, in send
2024-08-29 22:21:24     resp = conn.urlopen(
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 802, in urlopen
2024-08-29 22:21:24     retries = retries.increment(
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 594, in increment
2024-08-29 22:21:24     raise MaxRetryError(_pool, url, error or ResponseError(cause))
2024-08-29 22:21:24 urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='graph.microsoft.com', port=443): Max retries exceeded with url: /v1.0/users/dmarc-rua@byu.edu/messages/AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONSAAA=/$value (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4b2c0e1fa0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
2024-08-29 22:21:24 
2024-08-29 22:21:24 During handling of the above exception, another exception occurred:
2024-08-29 22:21:24 
2024-08-29 22:21:24 Traceback (most recent call last):
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/parsedmarc/cli.py", line 1005, in _main
2024-08-29 22:21:24     reports = get_dmarc_reports_from_mailbox(
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/parsedmarc/__init__.py", line 427, in get_dmarc_reports_from_mailbox
2024-08-29 22:21:24     msg_content = connection.fetch_message(msg_uid)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/parsedmarc/mail/graph.py", line 220, in fetch_message
2024-08-29 22:21:24     result = self._client.get(url)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/msgraph/core/_graph_client.py", line 46, in wrapper
2024-08-29 22:21:24     return func(*args, **kwargs)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/msgraph/core/_graph_client.py", line 93, in get
2024-08-29 22:21:24     return self.graph_session.get(self._graph_url(url), **kwargs)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 602, in get
2024-08-29 22:21:24     return self.request("GET", url, **kwargs)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
2024-08-29 22:21:24     resp = self.send(prep, **send_kwargs)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
2024-08-29 22:21:24     r = adapter.send(request, **kwargs)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/middleware.py", line 44, in send
2024-08-29 22:21:24     return self._first_middleware.send(request, **kwargs)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/authorization.py", line 23, in send
2024-08-29 22:21:24     response = super().send(request, **kwargs)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/middleware.py", line 66, in send
2024-08-29 22:21:24     return self.next.send(request, **kwargs)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/retry.py", line 118, in send
2024-08-29 22:21:24     response = super().send(request, **kwargs)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/middleware.py", line 66, in send
2024-08-29 22:21:24     return self.next.send(request, **kwargs)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/telemetry.py", line 22, in send
2024-08-29 22:21:24     response = super().send(request, **kwargs)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/middleware.py", line 65, in send
2024-08-29 22:21:24     return super().send(request, **kwargs)
2024-08-29 22:21:24   File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 700, in send
2024-08-29 22:21:24     raise ConnectionError(e, request=request)
2024-08-29 22:21:24 requests.exceptions.ConnectionError: HTTPSConnectionPool(host='graph.microsoft.com', port=443): Max retries exceeded with url: /v1.0/users/dmarc-rua@byu.edu/messages/AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONSAAA=/$value (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4b2c0e1fa0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
2024-08-29 22:21:25     INFO:cli.py:794:Starting parsedmarc

While troubleshooting, I found it interesting that there are at least a couple of messages that seem to have the same UID:

Restart 16:47:49

2024-08-29 16:50:04    DEBUG:__init__.py:426:Processing message 55 of 369: UID AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONRAAA=

2024-08-29 16:57:24    DEBUG:__init__.py:426:Processing message 81 of 369: UID AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONrAAA=

Restart 21:08:52

2024-08-29 21:28:51    DEBUG:__init__.py:426:Processing message 55 of 369: UID AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONRAAA=

Restart 22:21:25

2024-08-29 22:47:36    DEBUG:__init__.py:426:Processing message 55 of 369: UID AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONRAAA=

2024-08-29 23:52:35    DEBUG:__init__.py:426:Processing message 81 of 369: UID AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONrAAA=
rodpayne commented 2 weeks ago

I haven't made it all of the way through the benchmarks yet, but I have identified a couple of fixes to make it run faster. I will open a pull request so that you can take a look.

Comparing partial runs, which could have some observation problems, 232 email messages took 11 hours and 29 minutes with the main branch, and 49 minutes with the rodpayne branch. It will be at least Tuesday before I am back into it.

nhairs commented 2 weeks ago

Ah, something I should add as well is the new application is parsedmarcd (note the d at the end).

The config format has changed as well, you'd want something like the following:

parser:
  nameservers:
    - 1.1.1.1
    - 8.8.8.8

sources:
  rods-email:
    class: .email:MicrosoftGraph
    auth_method: UsernamePassword
    client_id: rodsid
    clinet_secret: rodssecret

sinks:
  elasticsearch:
    class: .elasticsearch:Elasticsearch
    client:
      hosts: localhost:9200

parsedmarcd -c rodsconfig.yml -vv

rodpayne commented 1 week ago

I think that I am going to have to wait for some documentation on the new config file format. I've gotten lost in the weeds after switching from parsedmarc to parsedmarcd.

nhairs commented 1 week ago

I think that I am going to have to wait for some documentation on the new config file format. I've gotten lost in the weeds after switching from parsedmarc to parsedmarcd.

Yeah that's fair enough, I'll let you know when we've got better documentation up.

rodpayne commented 1 week ago

I was getting TypeError: unsupported operand type(s) for |: 'type' and 'NoneType' so I moved from python 3.9 to 3.10.

What is your preferred python version?

nhairs commented 1 week ago

That is somewhat surprising.

I've run it on a mix, though I can say the I have it reliably running on 3.11 (it is too late at night for me to look into my other devices).

That said, the tests are running in 3.8 so I would have though it got caught.

It's possible that it's from a file that is missing a from __future__ import annotations which should prevent most of the type compatibility issues.;

rodpayne commented 1 week ago

Here is some detail:

2024-09-06 15:27:39 [21:27:39][INFO]: parsedmarcd starting
2024-09-06 15:27:39 [21:27:39][DEBUG]: loading parser
2024-09-06 15:27:39 [21:27:39][DEBUG]: loading sources
2024-09-06 15:27:39 [21:27:39][CRITICAL]: Failed to load source oneday-email
2024-09-06 15:27:39 Traceback (most recent call last):
2024-09-06 15:27:39   File "/usr/local/lib/python3.9/site-packages/pydantic/_internal/_typing_extra.py", line 303, in _eval_type_backport
2024-09-06 15:27:39     return _eval_type(value, globalns, localns, type_params)
2024-09-06 15:27:39   File "/usr/local/lib/python3.9/site-packages/pydantic/_internal/_typing_extra.py", line 332, in _eval_type
2024-09-06 15:27:39     return typing._eval_type(  # type: ignore
2024-09-06 15:27:39   File "/usr/local/lib/python3.9/typing.py", line 292, in _eval_type
2024-09-06 15:27:39     return t._evaluate(globalns, localns, recursive_guard)
2024-09-06 15:27:39   File "/usr/local/lib/python3.9/typing.py", line 554, in _evaluate
2024-09-06 15:27:39     eval(self.__forward_code__, globalns, localns),
2024-09-06 15:27:39   File "<string>", line 1, in <module>
2024-09-06 15:27:39 TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'