Open nhairs opened 3 months ago
Hi @rodpayne,
Given your PR, I was wondering if you'd be interested in testing this fork as it has a new implementation that /might/ improve efficiency with your use case (but also might not).
I'll note that this fork is probably best considered as alpha so if you'd like to wait till there's some more documentation that's also fine.
The one-day sample has 428 mail messages with 513,330 reports from 4/10/2024.
I was also wondering if you'd be willing to share some information about how you end up with such a low message count but very high report count.
I gave it a try and haven't yet figured out how to build parsedmarc
. My first try, I got this:
=> ERROR [parsedmarc 3/8] COPY parsedmarc/ parsedmarc/ 0.0s
------
> [parsedmarc 3/8] COPY parsedmarc/ parsedmarc/:
------
failed to solve: failed to compute cache key: failed to calculate checksum of ref 8097a7e0-404a-4949-80d6-becf106fb1f6::s7nxl91kini1d904q5umf84dg: "/parsedmarc": not found
I updated the Dockerfile
to have COPY src/parsedmarc/ parsedmarc/
instead of COPY parsedmarc/ parsedmarc/
, getting a little further, and then got:
> [parsedmarc 7/8] RUN hatch build:
0.495 ──────────────────────────────────── sdist ─────────────────────────────────────
0.500 Setting up build environment
2.800 Traceback (most recent call last):
2.800 File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
2.800 return _run_code(code, main_globals, None,
2.800 File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
2.800 exec(code, run_globals)
2.800 File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/__main__.py", line 6, in <module>
2.800 sys.exit(hatchling())
2.800 File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/cli/__init__.py", line 26, in hatchling
2.800 command(**kwargs)
2.800 File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/cli/build/__init__.py", line 82, in build_impl
2.800 for artifact in builder.build(
2.800 File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/builders/plugin/interface.py", line 90, in build
2.801 self.metadata.validate_fields()
2.801 File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/metadata/core.py", line 265, in validate_fields
2.801 _ = self.version
2.801 File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/metadata/core.py", line 149, in version
2.801 self._version = self._get_version()
2.801 File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/metadata/core.py", line 248, in _get_version
2.801 version = self.hatch.version.cached
2.801 File "/root/.local/share/hatch/env/virtual/parsedmarc/9TtSrW0h/parsedmarc-build/lib/python3.9/site-packages/hatchling/metadata/core.py", line 1466, in cached
2.802 raise type(e)(message) from None
2.802 OSError: Error getting the version from source `regex`: file does not exist: src/parsedmarc/__init__.py
------
failed to solve: process "/bin/sh -c hatch build" did not complete successfully: exit code: 1
As you've noticed the repository layout has changed parsedmarc -> src/parsedmarc
, with the pyproject.toml
updated to reflect the new structure.
I suspect that you should instead use this copy: COPY src/ src/
Though also I'll note, I've not been using docker/hatch to develop, I've just been using a virtual environment + pip install -e '.[dev]'
Thank you, that change worked. I am running a test with a day's worth of input emails.
While running for 11 hours so far, it has restarted from the beginning three times, with logs like the following:
2024-08-29 22:21:06 DEBUG:__init__.py:426:Processing message 56 of 369: UID AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONSAAA=
2024-08-29 22:21:24 ERROR:cli.py:1022:Mailbox Error
2024-08-29 22:21:24 Traceback (most recent call last):
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
2024-08-29 22:21:24 conn = connection.create_connection(
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 72, in create_connection
2024-08-29 22:21:24 for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/socket.py", line 954, in getaddrinfo
2024-08-29 22:21:24 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
2024-08-29 22:21:24 socket.gaierror: [Errno -3] Temporary failure in name resolution
2024-08-29 22:21:24
2024-08-29 22:21:24 During handling of the above exception, another exception occurred:
2024-08-29 22:21:24
2024-08-29 22:21:24 Traceback (most recent call last):
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 716, in urlopen
2024-08-29 22:21:24 httplib_response = self._make_request(
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 404, in _make_request
2024-08-29 22:21:24 self._validate_conn(conn)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1061, in _validate_conn
2024-08-29 22:21:24 conn.connect()
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 363, in connect
2024-08-29 22:21:24 self.sock = conn = self._new_conn()
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
2024-08-29 22:21:24 raise NewConnectionError(
2024-08-29 22:21:24 urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f4b2c0e1fa0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
2024-08-29 22:21:24
2024-08-29 22:21:24 During handling of the above exception, another exception occurred:
2024-08-29 22:21:24
2024-08-29 22:21:24 Traceback (most recent call last):
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 667, in send
2024-08-29 22:21:24 resp = conn.urlopen(
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 802, in urlopen
2024-08-29 22:21:24 retries = retries.increment(
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 594, in increment
2024-08-29 22:21:24 raise MaxRetryError(_pool, url, error or ResponseError(cause))
2024-08-29 22:21:24 urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='graph.microsoft.com', port=443): Max retries exceeded with url: /v1.0/users/dmarc-rua@byu.edu/messages/AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONSAAA=/$value (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4b2c0e1fa0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
2024-08-29 22:21:24
2024-08-29 22:21:24 During handling of the above exception, another exception occurred:
2024-08-29 22:21:24
2024-08-29 22:21:24 Traceback (most recent call last):
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/parsedmarc/cli.py", line 1005, in _main
2024-08-29 22:21:24 reports = get_dmarc_reports_from_mailbox(
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/parsedmarc/__init__.py", line 427, in get_dmarc_reports_from_mailbox
2024-08-29 22:21:24 msg_content = connection.fetch_message(msg_uid)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/parsedmarc/mail/graph.py", line 220, in fetch_message
2024-08-29 22:21:24 result = self._client.get(url)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/msgraph/core/_graph_client.py", line 46, in wrapper
2024-08-29 22:21:24 return func(*args, **kwargs)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/msgraph/core/_graph_client.py", line 93, in get
2024-08-29 22:21:24 return self.graph_session.get(self._graph_url(url), **kwargs)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 602, in get
2024-08-29 22:21:24 return self.request("GET", url, **kwargs)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
2024-08-29 22:21:24 resp = self.send(prep, **send_kwargs)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
2024-08-29 22:21:24 r = adapter.send(request, **kwargs)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/middleware.py", line 44, in send
2024-08-29 22:21:24 return self._first_middleware.send(request, **kwargs)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/authorization.py", line 23, in send
2024-08-29 22:21:24 response = super().send(request, **kwargs)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/middleware.py", line 66, in send
2024-08-29 22:21:24 return self.next.send(request, **kwargs)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/retry.py", line 118, in send
2024-08-29 22:21:24 response = super().send(request, **kwargs)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/middleware.py", line 66, in send
2024-08-29 22:21:24 return self.next.send(request, **kwargs)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/telemetry.py", line 22, in send
2024-08-29 22:21:24 response = super().send(request, **kwargs)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/msgraph/core/middleware/middleware.py", line 65, in send
2024-08-29 22:21:24 return super().send(request, **kwargs)
2024-08-29 22:21:24 File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 700, in send
2024-08-29 22:21:24 raise ConnectionError(e, request=request)
2024-08-29 22:21:24 requests.exceptions.ConnectionError: HTTPSConnectionPool(host='graph.microsoft.com', port=443): Max retries exceeded with url: /v1.0/users/dmarc-rua@byu.edu/messages/AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONSAAA=/$value (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4b2c0e1fa0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
2024-08-29 22:21:25 INFO:cli.py:794:Starting parsedmarc
While troubleshooting, I found it interesting that there are at least a couple of messages that seem to have the same UID:
Restart 16:47:49
2024-08-29 16:50:04 DEBUG:__init__.py:426:Processing message 55 of 369: UID AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONRAAA=
2024-08-29 16:57:24 DEBUG:__init__.py:426:Processing message 81 of 369: UID AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONrAAA=
Restart 21:08:52
2024-08-29 21:28:51 DEBUG:__init__.py:426:Processing message 55 of 369: UID AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONRAAA=
Restart 22:21:25
2024-08-29 22:47:36 DEBUG:__init__.py:426:Processing message 55 of 369: UID AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONRAAA=
2024-08-29 23:52:35 DEBUG:__init__.py:426:Processing message 81 of 369: UID AAMkADBlNmE1ZGE4LTZiMmEtNGRjNy1hOTBjLTdlYzYyZDQ3NDk5MgBGAAAAAADuXNUN9RbcS54KBRQSesLiBwBH3iRJzGOnR4SOY_QCRBtfAAT50viWAABH3iRJzGOnR4SOY_QCRBtfAAVbWONrAAA=
I haven't made it all of the way through the benchmarks yet, but I have identified a couple of fixes to make it run faster. I will open a pull request so that you can take a look.
Comparing partial runs, which could have some observation problems, 232 email messages took 11 hours and 29 minutes with the main
branch, and 49 minutes with the rodpayne
branch. It will be at least Tuesday before I am back into it.
Ah, something I should add as well is the new application is parsedmarcd
(note the d
at the end).
The config format has changed as well, you'd want something like the following:
parser:
nameservers:
- 1.1.1.1
- 8.8.8.8
sources:
rods-email:
class: .email:MicrosoftGraph
auth_method: UsernamePassword
client_id: rodsid
clinet_secret: rodssecret
sinks:
elasticsearch:
class: .elasticsearch:Elasticsearch
client:
hosts: localhost:9200
parsedmarcd -c rodsconfig.yml -vv
I think that I am going to have to wait for some documentation on the new config file format. I've gotten lost in the weeds after switching from parsedmarc
to parsedmarcd
.
I think that I am going to have to wait for some documentation on the new config file format. I've gotten lost in the weeds after switching from
parsedmarc
toparsedmarcd
.
Yeah that's fair enough, I'll let you know when we've got better documentation up.
I was getting TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
so I moved from python 3.9 to 3.10.
What is your preferred python version?
That is somewhat surprising.
I've run it on a mix, though I can say the I have it reliably running on 3.11 (it is too late at night for me to look into my other devices).
That said, the tests are running in 3.8 so I would have though it got caught.
It's possible that it's from a file that is missing a from __future__ import annotations
which should prevent most of the type compatibility issues.;
Here is some detail:
2024-09-06 15:27:39 [21:27:39][INFO]: parsedmarcd starting
2024-09-06 15:27:39 [21:27:39][DEBUG]: loading parser
2024-09-06 15:27:39 [21:27:39][DEBUG]: loading sources
2024-09-06 15:27:39 [21:27:39][CRITICAL]: Failed to load source oneday-email
2024-09-06 15:27:39 Traceback (most recent call last):
2024-09-06 15:27:39 File "/usr/local/lib/python3.9/site-packages/pydantic/_internal/_typing_extra.py", line 303, in _eval_type_backport
2024-09-06 15:27:39 return _eval_type(value, globalns, localns, type_params)
2024-09-06 15:27:39 File "/usr/local/lib/python3.9/site-packages/pydantic/_internal/_typing_extra.py", line 332, in _eval_type
2024-09-06 15:27:39 return typing._eval_type( # type: ignore
2024-09-06 15:27:39 File "/usr/local/lib/python3.9/typing.py", line 292, in _eval_type
2024-09-06 15:27:39 return t._evaluate(globalns, localns, recursive_guard)
2024-09-06 15:27:39 File "/usr/local/lib/python3.9/typing.py", line 554, in _evaluate
2024-09-06 15:27:39 eval(self.__forward_code__, globalns, localns),
2024-09-06 15:27:39 File "<string>", line 1, in <module>
2024-09-06 15:27:39 TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
To close the loop on the above thread of comments - this was fixed in https://github.com/nhairs/parsedmarc-fork/pull/30
It would be good for some "heavy" users of parsedmarc to start testing the fork.
Starting this issue as a place to aggregate the data / find out a bit more of the use-cases.