Open clhunsen opened 9 years ago
I had another unshaped thought on this: What about making the several patterns that are supported by Codeface explicit in the source code?
To be more specific, I thought about having one explicit pattern definition (i.e., a regex) and one transformation pattern (i.e., the regex replacement or rewriting) for each pattern Codeface supports, to transform the various patterns to the one we want to have (Hans Huber <huber@hubercorp.com>
).
Plus a routine for mis-shaped strings, where we need to handle missing e-mail addresses, missing names, or similar.
This way, we only need to take care of one pattern for extraction, and the different patterns would be explicit and transparent in the source code, too.
Problem
When considering the following two
From
lines in mbox files, Codeface will run into problems right now:While the string
at
is replaced by@
in the first case, it is not in the second. In the first case, the name is not properly parsed (it isZsbán Ambrus
actually), the string is stored as is in the database.Fix for case two
The following patch by @wolfgangmauerer (taken from the mailing-list, tested by me) implements a more robust handling of e-mail addresses and is able to handle the second case to be transformed correctly.
After applying the patch, the database contains:
Things to do