tenekev / immich-auto-stack

This is a simple, yet highly configurable Python script, dressed as a Docker container, that stacks together photos, based on certain criteria.
GNU Affero General Public License v3.0
16 stars 2 forks source link

Matching Google Camera RAW + JPEG #1

Open ddbb07 opened 1 month ago

ddbb07 commented 1 month ago

The current script doesn't work for Google Camera RAW + JPEG since it uses the following naming scheme "PXL_20240808_195214323.RAW-01.COVER.jpg" "PXL_20240808_195214323.RAW-02.ORIGINAL.dng"

These can be matched using only the suffixes "*-01.COVER.jpg" "*-02.ORIGINAL.dng" I don't know how you would want this to be implemented since the current code only checks for file exensions.

tenekev commented 1 month ago

I can implement an ignore_string argument. When you specify "-01.COVER" and "-02.ORIGINAL" it will filter them from the filenames before matching duplicates.

It's really easy the do but I wonder if that's the best approach.

m3brown commented 1 month ago

I started working on customizable criteria because I had a similar desire for more flexibility. I just opened #2, which is an approach to controlling the match logic with regex passed in as an env var.

I did some tweaking of my preferred regex to also suit your example strings:

r"([A-Z]+[-_]?[0-9]{4,}([-_][0-9]{4,})?)([\._-].*)?\.[\w]{3,4}$"

image

I'm very open to feedback and ideas.

m3brown commented 1 month ago

@ddbb07 I believe the change above has been pushed to ghcr.io/tenekev/immich-auto-stack:latest, if you want to give it a whirl.

PythonNut commented 3 weeks ago

I made an effort to design my own regex, but I ran into another issue. Apparently Pixels record JPEG and RAW with different localDateTimes. This causes the matching to fail if using the localDateTime which seems important to keep for, e.g. Sony ARWs. Fortunately it's at least possible to figure this out because:

  1. The Pixel does at least give the same base file name.
  2. The base name includes the localDateTime of the JPEG (in both cases) to millisecond precision.

Thus for Pixel files specifically, we can safely ignore the localDateTime and match based only on the file name. But of course, we need localDateTime for other files so there's no way to configure things correctly right now.

One way this could be addressed is to provide a list of possible criteria combinations and the program steps through the list and takes the first combination that doesn't fail (i.e. all regexes match). Then we could write a very narrow regex for Pixel files and fall back to a relaxed regex + timestamp for other files.

Of course, creating a matching DSL of ever increasing complexity does seem like a slippery slope to madness so it's worth thinking about this carefully, I suppose.

tenekev commented 3 weeks ago

It's a very slippery slope, indeed.

I believe you can still use a part of the localDateTime parameter. I included it initially because most cameras have a 4-digit sequence of 9999 and then they reset back to 0. If you've shot 40k photos, you can potentially encounter the same filename 4 times. localDateTime ensures they aren't stacked together.

If you split for example "localDateTime": "2024-07-22T15:04:28.000Z" on the T, you can use just the date portion. Here is an example:

[
  {
    "key": "originalFileName",
    "regex": {
      "key": "YOUR REGEX",
      "index": ?
    }
  },
  {
    "key": "localDateTime",
    "split": {
      "key": "T",
      "index": 0
    }
  }
]
tenekev commented 3 weeks ago

Maybe we can even default to the date portion because it's very unlikely for someone to generate 10000 photos in a single day. And I can't think of any other scenario where the time portion matters.

m3brown commented 3 weeks ago

This is fun! Thanks for sharing your scenario.

Thought 1: In my PR a few days ago, I added a flag to silently ignore files that do not match a regex, rather than halting the script. This would allow you to run the script twice with two different configs:

  1. Regex whose prefix would only match pixel images (e.g. starts with PXL), with no date time check
  2. Default (or similar) config to capture all the remaining files

it'd be cool to be able to define a config that could do the above in one pass (the fallback approach you suggested) but I think the two pass approach is a reasonable workaround for now.

Thought 2: The value of the date check, in my opinion, is to for file formats like IMG_1234.jpg that are common across cameras. I have probably 10 IMG_1234.jpg files for the various cameras and phones I've owned over the past 15 years.

The odds are very slim that more granularity is required than "date" (without time), so I support @tenekev's suggested default.

One situation I could see this being a problem is if two family members bought new phones on the same day, which isn't terribly uncommon. In this case their photo ids may sync up occasionally over time.

PythonNut commented 3 weeks ago

I think it's potentially a bit risky to discard the time portion of the datetime. @m3brown suggested unwanted collisions could happen if the numbers ever sync up on the same day. Under certain circumstances, I think it could happen often enough to be a bit annoying. For example, if multiple people go on a trip and bring cameras with the same brand. With a dedicated camera, it's quite possible to shoot hundreds of photos in a day (even thousands, although I've never hit 10k in a single day) so the chance of collision is not negligible (it happens to me every once in a while). For this reason, I'd be wary of using the date (only) as a default.

For now, I've settled on having multiple copies of the immich-auto-stack container running at the same time using SKIP_MATCH_MISS: True. It's not elegant, but it's effectively very similar to the fallback list I suggested and it does not require any manual intervention, which is nice.

tenekev commented 3 weeks ago

One situation I could see this being a problem is if two family members bought new phones on the same day...

I have not thought of multi-user instances but I guess we can solve this by adding ownerId as the primary criteria.

ownerId > originalFileName.split('.')[0] > `localDateTime.split('T')[0]

This should ensure that assets from different users aren't mixed although I find it highly unlikely that two users will generate the same filename in the same day.

Another approach is to use regex on the localDateTime string instead of a simple split method. If we can capture everything but the seconds, yyyy-MM-ddTHH:mm, the chances of two assets with the same name within a minute are extremely small.

As for the SKIP_MATCH_MISS functionality, I think it needs some more explaining. I think it's a good idea but I wasn't entirely sure how to explain it in the readme.

We can probably eliminate the split method entirely and rely on regex for everything. originalFileName.split('.')[0] doesn't always work as intended - @ddbb07 's sample filenames contain more than one dot, resulting in unexpected results that just happen to work. Relying on regex only will streamline the process.

It might be too ambitious but building onto that we can even have multiple sets of criteria for different filename schemas. Every schema will match with a certain filename or fail silently to the next.

PythonNut commented 2 weeks ago

I think multiple sets of criteria do make sense. It would break backward compatibility, but you could pull the structure into the YAML itself so it's not so hard to edit. E.g., if you wanted to merge photos from Sony cameras and Pixel phones separately, you could potentially write something like:

- CRITERIA: 
  - - key: originalFileName
      regex: "DSC..."
      index: 1
    - key: localDateTime

  - - key: originalFileName
      regex: "PXL..."
      index: 1

In this case, SKIP_MATCH_MISS would presumably come into play after all supplied patterns fail to match.

tenekev commented 1 week ago

I really wanted to contain the whole service and its config in a single docker-compose block. And I really don't want to move from this idea. I've always preferred portable service.

Meeting you halfway, we can do something like:

environment:
  CRITERIA_0_key: originalFileName
  CRITERIA_0_regex: "DSC..."
  CRITERIA_0_index: 1
  CRITERIA_1_key: originalFileName
  CRITERIA_1_regex: "PXL..."
  CRITERIA_1_index: 1
PythonNut commented 1 week ago

Ah right, I forgot we have to pack it all into environment variables at the end of the day, which your format addresses of course. The one missing thing is that it has only has one dimension of indexing. Not sure if the index you wrote is enumerating separate criteria or the fields of a single criteria, but I think both levels are important?

m3brown commented 1 week ago

While dynamic environment variables could work, it sounds messy. Something like a while loop that expects some or all of keys, regex, and index to be defined. It'd be up to the user to ensure the numbers are sequential (i.e. we didn't skip from 2 to 4, missing 3).

The current criteria parser expects json as a single environment variable, and docker-compose supports multi-line json. I imagine a rejiggering of the expected CRITERIA json schema would be equivalent to the yaml proposal.

environment:
  CRITERIA: > 
    [
      [
        {
          "key": "originalFileName",
          "regex": "DSC...",
          "index": 1
        },
        {
          "key": "localDateTime"
        }
      ],
      [
        {
          "key": "originalFileName",
          "regex": "PXL...",
          "index": 1
        }
      ]
    ]

The nested structure gets a little gnarly to navigate. Nested arrays seems a little intense when most people will only have one criteria block. But I think it is an accurate description!

Another idea to make the structure a little flatter is to define the priority of each block. Below, there are two priority 1 so they join to make a set.

environment:
  CRITERIA: > 
    [
      {
        "key": "originalFileName",
        "regex": "DSC...",
        "regex_index": 1,
        "priority": 1
      },
      {
        "key": "localDateTime",
        "priority": 1,
      }
    ],
    [
      {
        "key": "originalFileName",
        "regex": "PXL...",
        "regex_index": 1,
        "priority": 2
      }
    ]

I'm sure we can come up with more ideas of the ideal schema. My biggest point is I think json could work well for the need.

tenekev commented 1 week ago

While dynamic environment variables could work, it sounds messy.

I thought so too but once you parse the variables with CRITERIA_ keyword into a json structure, you can validate every object against 2-3 rules and spit out any inconsistencies or ignore them altogether. They don't even have to be sequentially indexed. As long as x and y in CRITERIA_x_y_key are integers, they are ordered.

I even wrote a quick and dirty PoC but I've been too busy to do anything this week.

import os, re, json
from jsonschema import validate, ValidationError

criteria_schema = {
  "type": "object",
  "properties": {
    "key": {"type": "string"},
    "regex": {"type": "string"},
    "index": {"type": "integer"}
  },
  "required": ["key"],
  "dependencies": {
    "regex": ["index"]
  }
}

def extract_criteria():
  criteria_pattern = re.compile(r'^CRITERIA_(\d{1,3})_(\d{1,3})_(\w+)$')
  criteria_dict = {}

  for env_var, value in os.environ.items():
    match = criteria_pattern.match(env_var)

    if match:
      priority, index, field = match.groups()

      if priority not in criteria_dict:
        criteria_dict[priority] = {}

      if index not in criteria_dict[priority]:
        criteria_dict[priority][index] = {}

      criteria_dict[priority][index][field] = value

  return criteria_dict

def validate_criteria(criteria_dict):
  valid_criteria = [None for x in range(100)]

  for priority_index, priority in criteria_dict.items():
    valid_criteria[int(priority_index)] = []

    for criteria_index, criteria in priority.items():

      try:
        criteria['index'] = int(criteria['index'])

      except (ValueError, KeyError):
        print(f"Invalid index for CRITERIA_{priority_index}_{criteria_index}: {criteria.get('index')}")
        continue

      try:
        validate(instance=criteria, schema=criteria_schema)
        valid_criteria[int(priority_index)].append(criteria)

      except ValidationError as e:
        print(f"Validation error for CRITERIA_{priority_index}_{criteria_index}: {e.message}")

    if valid_criteria[int(priority_index)] == [] :
      valid_criteria[int(priority_index)] = None

  valid_criteria = [x for x in valid_criteria if x is not None]

  return valid_criteria

if __name__ == "__main__":

  os.environ["CRITERIA_0_0_key"] = "originalFileName"
  os.environ["CRITERIA_0_0_regex"] = "DSC.*"
  os.environ["CRITERIA_0_0_index"] = "1"
  os.environ["CRITERIA_0_3_key"] = "localDateTime"
  os.environ["CRITERIA_0_3_regex"] = "(\d\d\d\d-\d\d-\d\dT\d\d\:\d\d)\:\d\d\..+"
  os.environ["CRITERIA_0_3_index"] = "1"
  os.environ["CRITERIA_1_0_key"] = "originalFilerName"
  os.environ["CRITERIA_1_0_regex"] = "PXrL.*"
  os.environ["CRITERIA_1_0_index"] = "0"

  # This criteria is missing a required key
  os.environ["CRITERIA_5_0_regex"] = "PXrL.*"
  os.environ["CRITERIA_5_0_index"] = "0"

  # This criteria is missing a required index for its RegEx
  os.environ["CRITERIA_17_3_key"] = "localDateTime"
  os.environ["CRITERIA_17_3_regex"] = "(\d\d\d\d-\d\d-\d\dT\d\d\:\d\d)\:\d\d\..+"

  # -----

  extracted_criteria = extract_criteria()
  valid_criteria = validate_criteria(extracted_criteria)

  print(json.dumps(valid_criteria, indent=2))

Returns this:

Validation error for CRITERIA_5_0: 'key' is a required property
Invalid index for CRITERIA_17_3: None
[
 // Priority 0
  [ 
    // Criteria 0
    {
      "key": "originalFileName",
      "regex": "DSC.*",
      "index": 1
    },
    // Criteria 1 (but was named 3 in the env var)
    {
      "key": "localDateTime",
      "regex": "(\\d\\d\\d\\d-\\d\\d-\\d\\dT\\d\\d\\:\\d\\d)\\:\\d\\d\\..+",
      "index": 1
    }
  ],
  // Priority 1
  [
    // Criteria 0
    {
      "key": "originalFilerName",
      "regex": "PXrL.*",
      "index": 0
    }
  ]
]

This is how priority can be denoted:

environment:
  CRITERIA_0_0_key: originalFileName
#     ^    ^ ^  ^          ^
#     |    | |  |          Value
#     |    | |  Key 
#     |    | Criteria Index
#     |    Priority Index
#     Keyword that triggers the parser

Does this get way too complicated? The main reason behind this effort is my concern about mixing yaml and json with multiline which I've always found a bit crummy.