tfeldmann / organize

The file management automation tool.
http://organize.readthedocs.io
MIT License
2.33k stars 134 forks source link

filecontent: RegEx expression to find a string with multiple words not working #142

Open dbffm opened 3 years ago

dbffm commented 3 years ago

Hi all,

I dont know if it is a bug or if I am using it wrong. I am new to regular expressions but did a lot of research and did some excercises on https://regex101.com/

I have a searchable pdf file (created with synoOCR) and I am searching for the String "Beispiel Lebensversicherung AG"

my config.yaml looks like this:

rules:
  - folders: /Documents/Scanner/action
    subfolders: true
    filters:
    - extension: pdf
    actions:
        - echo: "Found PDF!"

    #Beispiel Lebensversicherung
  - folders: /Documents/Scanner/action
    filters:
    - extension: pdf
    - filecontent:
      - \bBeispiel\s\bLebensversicherung\s\bAG
    actions:
    - copy: 'custom folder'

Unfortunately the rule is not working. What I found out that is working when I only search for one word of the string. For example:

    - filecontent:
      - \bLebensversicherung

is working ; but when combining it with the two other words the rule is not executed. I tested the expression on regex101 --> https://regex101.com/r/7bFIu8/1

Do I have to change the filecontent line in the config file? I also concern about: How to add the regex options (flags) in the yaml file?

dbffm commented 3 years ago

Update: I did another test with

  - folders: /Documents/Scanner/action
    subfolders: true
    filters:
    - extension: pdf
    - filecontent: '(?P<all>.*)'
    actions:
    - echo: "{filecontent.all}"

Now I see in the log file why the rule has not been executed: There are two whitespaces between each word. When I manually copy&paste the text out of my pdf reader (foxit reader on windows) there is only one whitespace between the words.

So changing the regex to

    - filecontent:
      - \bBeispiel\s+\bLebensversicherung\s+\bAG

is the solution for me.

So I still have two open questions: Why is textract reading two whitespaces compared to my pdf reader? How to add the Regex flags to the config.yaml file?