Open sp451 opened 1 year ago
Just brainstorming here, I can think of a couple of different ways to do this.
metadata_regex
, like metadata_title_regex
. This has the benefit of letting you specify a different regex for the title than for document content.metadata_regex
to the title as well. This has the benefit of being simpler, but it's also less powerful.In either case, the question I'm struggling with is whether to apply the regex to the title before or after applying it to the document contents. This matters because if you try to extract the same metadata field from both and get conflicting results, the one that was extracted last is gong to take precedence.
Unfortunately there's no elegant way to specify the order given the current yaml structure (since it's a dictionary and not a list) without adding another boolean config entry like process_title_first
.
It's not pretty, but maybe having that boolean is the way to go...
Adding metadata_title_regex
would be my preference. I agree that it's cleaner.
I see your point about the boolean but yeah, having a way to define the order seems necessary.
It'd be grand if the filename could be treated just like the document's content. This comes in handy if there is some sort of naming convention with the filenames and things like doc type, asn, correspondent and even tags are stored in the filename. Especially if coming from a legacy type of working without a DMS like paperless-ngx. You provided a workaround, info to be found here https://github.com/paperless-ngx/paperless-ngx/discussions/1935