jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

Correspondent strings too long when fetching from email sender name #82

Closed vdcloudcraft closed 3 years ago

vdcloudcraft commented 3 years ago

Hi, I've set up paperless-ng with docker-compose and Postgres. While tinkering with some mail settings I've come across following traceback in the admin logs:

Rule Consume-GMail: Error while processing mail 174 of account gmail : Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 573, in get_or_create
return self.get(**kwargs), False
File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 431, in get
self.model._meta.object_name
documents.models.Correspondent.DoesNotExist: Correspondent matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
psycopg2.errors.StringDataRightTruncation: value too long for type character varying(50)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_mail/mail.py", line 201, in handle_mail_account
processed_files = self.handle_message(message, rule)
File "/usr/src/paperless/src/paperless_mail/mail.py", line 245, in handle_message
correspondent = get_correspondent(message, rule)
File "/usr/src/paperless/src/paperless_mail/mail.py", line 119, in get_correspondent
"slug": slugify(correspondent_name)
File "/usr/local/lib/python3.7/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 576, in get_or_create
return self._create_object_from_params(kwargs, params)
File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 610, in _create_object_from_params
obj = self.create(**params)
File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 447, in create
obj.save(force_insert=True, using=self.db)
File "/usr/src/paperless/src/documents/models.py", line 71, in save
models.Model.save(self, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 754, in save
force_update=force_update, update_fields=update_fields)
File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 792, in save_base
force_update, using, update_fields,
File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 895, in _save_table
results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
File "/usr/local/lib/python3.7/site-packages/django/db/models/base.py", line 935, in _do_insert
using=using, raw=raw,
File "/usr/local/lib/python3.7/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 1254, in _insert
return query.get_compiler(using=using).execute_sql(returning_fields)
File "/usr/local/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1397, in execute_sql
cursor.execute(sql, params)
File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/usr/local/lib/python3.7/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
django.db.utils.DataError: value too long for type character varying(50)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/paperless_mail/tasks.py", line 11, in process_mail_accounts
account)
File "/usr/src/paperless/src/paperless_mail/mail.py", line 204, in handle_mail_account
f"Rule {rule.name}: Error while processing mail "
paperless_mail.mail.MailError: Rule Consume-GMail: Error while processing mail 174 of account gmail

I've set up some mail processing rules (which btw is an excellent feature that I plan on using extensively) and this one is configured to assign the correspondent from name. I've located the offending mail and it comes from an Amazon Marketplace vendor that crammed a stupid amount of marketing claims into its name, which is then reflected as the name of the sender in that mail.

So, seeing as I can't change that vendors name, would it be possible to either increase the character limit in the db for that field (50 seems somewhat low, seeing how those names can get quite long when being generated like Amazon does) or to just truncate the the name to 50 chars? I don't really see myself needing more than 50 chars after sanitizing those names, but the data needs to get into the system so I can actually do that.

Another effect that comes from this, is that the consumption process entirely fails after this error. This doesn't seem like it would be breaking the consumption for other mails, so maybe catching that error and continuing would be more intuitive. Right now one mail with unexpected format is preventing all other mails in that imap folder from being consumed.

I'd love to provide a PR for this, but unfortunately my Python/Django-Fu isn't advanced enough for this yet.

jonaswinkler commented 3 years ago

Glad you like the feature :)

OH. Well this is the stuff you usually don't think too much about, and I wasn't even aware of that character limit. Regarding extending the character limit: sure, we can do that. It's supposed to be 128 characters, but due to some weir legacy code that's still in there from OG paperless, it was limited to 50.

About truncating names: I'm not so sure. If we truncate correspondent names, Paperless would create a new correspondent object for the truncated name; possibly creating unwanted correspondents that you have to edit later, and new mails from the same correspondent with the long name would get truncated again, not match the edited correspondent and another correspondent is created. Not ideal.

I'd rather not set a correspondent at all in that case.

Another effect that comes from this, is that the consumption process entirely fails after this error. This doesn't seem like it would be breaking the consumption for other mails, so maybe catching that error and continuing would be more intuitive. Right now one mail with unexpected format is preventing all other mails in that imap folder from being consumed.

Sure, I'll put that on the list, seems reasonable.