Add capability to extract metadata directly from the filename

sp451 commented 1 year ago

It'd be grand if the filename could be treated just like the document's content. This comes in handy if there is some sort of naming convention with the filenames and things like doc type, asn, correspondent and even tags are stored in the filename. Especially if coming from a legacy type of working without a DMS like paperless-ngx. You provided a workaround, info to be found here https://github.com/paperless-ngx/paperless-ngx/discussions/1935

jgillula commented 1 year ago

Just brainstorming here, I can think of a couple of different ways to do this.

Have a separate line in the yaml file in addition to metadata_regex, like metadata_title_regex. This has the benefit of letting you specify a different regex for the title than for document content.
Just apply metadata_regex to the title as well. This has the benefit of being simpler, but it's also less powerful.

In either case, the question I'm struggling with is whether to apply the regex to the title before or after applying it to the document contents. This matters because if you try to extract the same metadata field from both and get conflicting results, the one that was extracted last is gong to take precedence.

Unfortunately there's no elegant way to specify the order given the current yaml structure (since it's a dictionary and not a list) without adding another boolean config entry like process_title_first.

It's not pretty, but maybe having that boolean is the way to go...

sp451 commented 1 year ago

Adding metadata_title_regex would be my preference. I agree that it's cleaner. I see your point about the boolean but yeah, having a way to define the order seems necessary.

jgillula / paperless-ngx-postprocessor

Add capability to extract metadata directly from the filename #12