psss / did

What did you do last week, month, year?
https://did.readthedocs.io/
GNU General Public License v2.0
245 stars 103 forks source link

Create a Public Inbox Plugin #329

Closed mripard closed 2 months ago

mripard commented 9 months ago

Public Inbox is a mailing-list archive project notably used by the Linux Foundation to host the lore.kernel.org ML archive.

This plugins allows to fetch from those archives the threads that the user started or was involved in and integrate them into the report.

This is a draft PR because there's a couple of things I'm unsure of and would like some feedback on:

Thanks!

sandrobonazzola commented 9 months ago

Public Inbox is a mailing-list archive project notably used by the Linux Foundation to host the lore.kernel.org ML archive.

That's nice

This plugins allows to fetch from those archives the threads that the user started or was involved in and integrate them into the report.

This is a draft PR because there's a couple of things I'm unsure of and would like some feedback on:

  • It's fairly slow at the moment. This is in part due to the fact that for each mail from the user, we fetch the entire thread, and multiple times if the user participated multiple times. I think a local cache could help with that. Is there any policy wrt caching? If so, how should it be done?
  • Other plugins seem to create unit tests. However, Given that this plugin depends fairly heavily on the archive itself, we would need to mock it somehow, which is pretty involved. Is the testing mandatory? If so, do we need a mock, or is there some other preferred strategy?

I think you can create a test fetching your own emails on a single day on the kernel mailing list; shouldn't overload the mail archive server and should be pretty quick if you choose a day with a single email and no long threads on it.

  • I've deviated a bit from the output the other plugins follow to show the mail thread and a link to it. I'd assume the preferred way to do that now would be to use the markdown formatting? If it's disabled, should we remove the link entirely, or something else?

Makes sense to me using the markdown format for adding links.

psss commented 8 months ago

Public Inbox is a mailing-list archive project notably used by the Linux Foundation to host the lore.kernel.org ML archive.

That's nice

Yeah, thanks for working on this improvement!

  • It's fairly slow at the moment. This is in part due to the fact that for each mail from the user, we fetch the entire thread, and multiple times if the user participated multiple times. I think a local cache could help with that. Is there any policy wrt caching? If so, how should it be done?

So far it seems no did plugin uses caching. I'd say we could start simple: Either create the first proof-of-concept without caching at all or implement simple cache in memory (just store and index all threads which have been already fetched before).

  • Other plugins seem to create unit tests. However, Given that this plugin depends fairly heavily on the archive itself, we would need to mock it somehow, which is pretty involved. Is the testing mandatory? If so, do we need a mock, or is there some other preferred strategy?

I think you can create a test fetching your own emails on a single day on the kernel mailing list; shouldn't overload the mail archive server and should be pretty quick if you choose a day with a single email and no long threads on it.

Agreed: Choosing a narrow since/until interval on a public public-inbox instance should be completely sufficient I'd say.

  • I've deviated a bit from the output the other plugins follow to show the mail thread and a link to it. I'd assume the preferred way to do that now would be to use the markdown formatting? If it's disabled, should we remove the link entirely, or something else?

Makes sense to me using the markdown format for adding links.

Sounds good to me to provide the rich output including links in the markdown format and use something concise (just the subject?) for the text output.

mripard commented 6 months ago

I've just updated the branch taking your comments into account:

mripard commented 6 months ago

Thanks much for the changes! Looks very good. I'm proposing just a couple of rather minor adjustments in 339d788. Could you please have a look if the changes are ok? Included is the --verbose mode for showing the full url of the message. I would like to keep this formatting (e.g. one-line per item by default) consistent across all plugins.

It looks great to me, thanks!

stefano-garzarella commented 5 months ago

@mripard thanks for this PR. Really useful!

I used this plugin for a while without issues, but I just hit one today while running did --since 2024-01-08 --until 2024-01-14.

My configuration is:

[general]
email = "Stefano Garzarella" <sgarzare@redhat.com>
width = 160
plugins = ~/.did/plugins

[ml]
type = public_inbox
url = https://lore.kernel.org

...

For now I have public_inbox.py as an external plugin in ~/.did/plugins, but it is a link to the file contained in this PR.

This is the error:

$ ~/repos/did/bin/did --since 2024-01-08 --until 2024-01-14 > W02.md
Traceback (most recent call last):
  File "/home/stefano/.did/plugins/public_inbox.py", line 110, in __get_mbox_from_content
    content = gzip.decompress(content)
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/gzip.py", line 627, in decompress
    if _read_gzip_header(fp) is None:
       ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/gzip.py", line 456, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'No')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/stefano/repos/did/bin/did", line 42, in <module>
    did.cli.main()
  File "/usr/lib/python3.12/site-packages/did/cli.py", line 238, in main
    user_stats.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
    stat.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
    stat.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 78, in check
    self.fetch()
  File "/home/stefano/.did/plugins/public_inbox.py", line 225, in fetch
    self.stats = [
                 ^
  File "/home/stefano/.did/plugins/public_inbox.py", line 190, in get_all_threads
    self.threads_cache[(since, until)] = self.__fetch_all_threads(since, until)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefano/.did/plugins/public_inbox.py", line 185, in __fetch_all_threads
    mbox = self.__get_mbox_from_content(resp.content)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefano/.did/plugins/public_inbox.py", line 111, in __get_mbox_from_content
    except BadGzipFile:
           ^^^^^^^^^^^
NameError: name 'BadGzipFile' is not defined

As rapid fix, I applied the following patch, but I'm not sure if it is the right thing to do:

diff --git a/did/plugins/public_inbox.py b/did/plugins/public_inbox.py
index ec8e333..d3df993 100644
--- a/did/plugins/public_inbox.py
+++ b/did/plugins/public_inbox.py
@@ -106,7 +106,10 @@ class PublicInbox(object):
                 item(self._get_message_url(msg), level=2, options=opt)

     def __get_mbox_from_content(self, content: bytes) -> mailbox.mbox:
-        content = gzip.decompress(content)
+        try:
+            content = gzip.decompress(content)
+        except gzip.BadGzipFile:
+            return {}

         with tempfile.NamedTemporaryFile() as tmp:
             tmp.write(content)
mripard commented 5 months ago

@stefano-garzarella Hi, I'm glad it's helpful. I've pushed a new version that should fix it, plus a unit test to cover the failing case (we were ignoring the HTTP request status code and assuming the result was a GZIP archive).

I've also squashed the patch @psss did earlier.

lukaszachy commented 5 months ago

/packit test

mripard commented 4 months ago

Hi, is there anything I can help with to get this merged?

psss commented 4 months ago

Sorry for late response, just too many things on my plate... I checked the latest version but I get the following error when running the test:

    def __get_thread_root(self, msg: Message) -> Message:
        log.debug("Looking for thread root of message %s" % msg.id())
        if msg.is_thread_root():
            log.debug("Message is thread root already. Returning.")
            return msg

        parent_id = msg.parent_id()
        if parent_id not in self.messages_cache:
            root = self.__fetch_thread_root(msg)
            log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
            return root

        while True:
            log.debug("Parent is %s" % parent_id)
>           assert parent_id in self.messages_cache
E           AssertionError

../../did/plugins/public-inbox.py:158: AssertionError

Could you please have a look at it?

Also, I'm not sure it's a good idea to call the plugin file public-inbox.py as it cannot be directly imported because of the dash in the name. What about just substituting the underscore with dash when importing the plugin modules?

mripard commented 4 months ago

Sorry for late response, just too many things on my plate... I checked the latest version but I get the following error when running the test:

    def __get_thread_root(self, msg: Message) -> Message:
        log.debug("Looking for thread root of message %s" % msg.id())
        if msg.is_thread_root():
            log.debug("Message is thread root already. Returning.")
            return msg

        parent_id = msg.parent_id()
        if parent_id not in self.messages_cache:
            root = self.__fetch_thread_root(msg)
            log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
            return root

        while True:
            log.debug("Parent is %s" % parent_id)
>           assert parent_id in self.messages_cache
E           AssertionError

../../did/plugins/public-inbox.py:158: AssertionError

Could you please have a look at it?

I think I fixed it already but wasn't sure whether I should push it or wait for it to be merged and then send it. I guess I have my answer :)

Also, I'm not sure it's a good idea to call the plugin file public-inbox.py as it cannot be directly imported because of the dash in the name. What about just substituting the underscore with dash when importing the plugin modules?

I don't have an opinion here, but also I'm not sure how to do that in Python. Would you have any pointers?

psss commented 4 months ago

I think I fixed it already but wasn't sure whether I should push it or wait for it to be merged and then send it. I guess I have my answer :)

:)

I don't have an opinion here, but also I'm not sure how to do that in Python. Would you have any pointers?

It would be just slightly modifying the way how plugins are imported. The relevant code should be around the load_components() function:

https://github.com/psss/did/blob/070a535d9250200ddfb8c43f9ae81f999fe51492/did/utils.py#L115

mripard commented 4 months ago

I just pushed a new version with public_inbox.py, and the bug you mentioned fixed with a unit test to cover the failing condition

mripard commented 2 months ago

Is there anything else I can do to help get this merged?

mripard commented 2 months ago

Thanks for the suggestions, I've added your suggestions or reworked the code to address your concerns.

psss commented 2 months ago

Thanks for addressing the comments! Now looks good. I've just modified the first commit summary and removed the prefix, see the recommendations for some more context.

stefano-garzarella commented 1 month ago

@mripard I just received this error today:

did --format markdown --since 2024-05-06 --until 2024-06-12
...
Traceback (most recent call last):
  File "/usr/bin/did", line 42, in <module>
    did.cli.main()
  File "/usr/lib/python3.12/site-packages/did/cli.py", line 239, in main
    user_stats.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
    stat.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
    stat.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 78, in check
    self.fetch()
  File "/usr/lib/python3.12/site-packages/did/plugins/public_inbox.py", line 231, in fetch
    self.stats = [
                 ^
  File "/usr/lib/python3.12/site-packages/did/plugins/public_inbox.py", line 207, in get_all_threads
    root = self.__get_thread_root(msg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/did/plugins/public_inbox.py", line 153, in __get_thread_root
    log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
                                                        ^^^^^^^
AttributeError: 'NoneType' object has no attribute 'id'

Any idea what is wrong?

mripard commented 1 month ago

Sorry for the late answer.

I don't have an immediate answer. Could you share your config (or at least the lore part) so I can reproduce it?

stefano-garzarella commented 1 month ago

Sorry for the late answer.

I don't have an immediate answer. Could you share your config (or at least the lore part) so I can reproduce it?

Sure, my config contains:

[general]
email = "Stefano Garzarella" <sgarzare@redhat.com>
width = 160

[ml]
type = public-inbox
url = https://lore.kernel.org
...

I had the issue running did --format markdown --since 2024-05-06 --until 2024-06-12.

For now, I'm using the following workaround:

diff --git a/did/plugins/public_inbox.py b/did/plugins/public_inbox.py
index 888f41c..70e1233 100644
--- a/did/plugins/public_inbox.py
+++ b/did/plugins/public_inbox.py
@@ -150,6 +150,9 @@ class PublicInbox(object):
         parent_id = msg.parent_id()
         if parent_id not in self.messages_cache:
             root = self.__fetch_thread_root(msg)
+            if root is None:
+                return None
+
             log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
             return root

@@ -164,6 +167,9 @@ class PublicInbox(object):
             parent_id = parent.parent_id()
             if parent_id not in self.messages_cache:
                 root = self.__fetch_thread_root(msg)
+                if root is None:
+                    return None
+
                 log.debug(
                     "Found root message %s for message %s" %
                     (root.id(), msg.id()))
@@ -205,6 +211,9 @@ class PublicInbox(object):

             if not msg.is_thread_root():
                 root = self.__get_thread_root(msg)
+                if root is None:
+                    continue
+
                 root_id = root.id()
                 if root_id in found:
                     log.debug("Root message already encountered... Skip.")