mapillary / mapillary_tools

Command line tools for processing and uploading Mapillary imagery
BSD 2-Clause "Simplified" License
267 stars 138 forks source link

Reduce processing overhead for large amount of imagery data #524

Open ptpt opened 2 years ago

ptpt commented 2 years ago

Although processing images (reading geotags) are fast, for large amount of data it's still slow. We should avoid processing the same images that has been uploaded or processed. See https://forum.mapillary.com/t/mapillary-tools-0-8-no-status-kept/6197/16

richlv commented 2 years ago

Heya, any luck with this one? Ended up reprocessing and EXIF-testing a lot of data today, would have benefited from the suggest functionality a lot :)

ptpt commented 2 years ago

@richlv maybe this isn't an issue if you process your images once, and then keep uploading.

Instead of keeping processing and uploading:

mapillary_tools process_and_upload path/to/images

use this workflow:

# process once (no need to process again unless you have new images added)
mapillary_tools process path/to/images

# then run upload (can be run multiple times due to interruptions)
mapillary_tools upload path/to/images
richlv commented 1 year ago

maybe this isn't an issue if you process your images once, and then keep uploading

Thank you for the workaround - that might work, but it adds extra complexity in the workflow. It already is some effort to capture and upload the images, would be great if the tools made this easier :)

Same as with the upload status tracking, could the processing status be tracked?

ptpt commented 1 year ago

It's a tradeoff here. It could be added for process in process_and_upload but it adds the complexity: we can't write status in the processed folders because they could be read-only, also users get annoyed by writing these temporary files. Then we need a centralized database to track the last time the folder was changed (mtime), and it is hard to do it right https://apenwarr.ca/log/20181113

Thank you for the workaround - that might work, but it adds extra complexity in the workflow.

Simply replace process_and_upload with upload. process_and_upload is an alias of process and upload. See: https://github.com/mapillary/mapillary_tools#process_and_upload

richlv commented 1 year ago

Thank you so much for looking into this. Can we please reopen this issue? Details on the reasoning follow.

As far as I understand, mapillary_tools already uses TMPDIR to store the archive files for upload. Thus status could be stored in this same directory, or a new environment variable like MLY_STATUS_DIR could be used.

Not quite sure why mtime would need to be tracked - just storing information of a particular file being processed seems sufficient. If I recall correctly, that is how the previous processing status tracking worked. If mtime would be desirable for cases where file path and name are unchanged, but contents change, that seems like an exception and could be left as something to be documented. Such an implementation would likely be sufficient for most if not all users/situations.

The replacing of process_and_upload with upload wouldn't work - that basically requires me to manually track processing status anyway, or script that logic myself. And it's so much better for such an improvement to be available to all users :)

To clarify on the need, a frequent usecase, especially when travelling:

To do so, I have a simple script that loops over these directories and runs process_and_upload. Directory status varies a lot - some directories are processed. Some are uploaded, but due to varying internet access quality, they are not always uploaded sequentially (some might fail in the middle), some can be uploaded partially. Usually only a portion of directories gets uploaded on any given day, and new directories are added on a daily basis.

Thus the only thing the script really needs to do is loop over these directories, and process & upload. But without the status tracking for processing in mapillary_tools, it's either a lot of re-processing, or I need to implement status tracking in my script. This is very tricky to do if I'd like to do it per image. Per directory could result in incorrect status if processing is interrupted. Any such implementation outside of mapillary_tools would also be fragile, as tool changes could break it, and cause data loss.

ptpt commented 1 year ago

Thanks for the context. Would you mind sharing the relevant part of your script?

Here is a simple idea that avoids reprocess by checking the existence of each image directory's mapillary_image_description.json

#!/bin/sh
set -e

# loop through all image directories
for image_dir in "$@"; do
    if [ -d "$image_dir" ]; then
        desc_path="${image_dir}/mapillary_image_description.json"
        # check if the desc file exists
        if [ -f "$desc_path" ]; then
            echo processed already "$image_dir"
        else
            mapillary_tools process "$image_dir" --desc_path="$desc_path"
        fi
        mapillary_tools upload "$image_dir" --desc_path="$desc_path"
    fi
done

Usage:

sh loop.sh ~/daily_captures/*
richlv commented 1 year ago

The script, if we exclude the part that does some GPX trickery for edge cases, is trivial. $1 is path (passed from a for loop in the command), $gt usually is empty.

export TMPDIR='/somewhere/tmp'
...
echo "y" | mapillary_tools process_and_upload --user_name richlv --interpolate_directions "$1" $gt --skip_process_errors

Thank you for the script idea, greatly appreciated.

If checking for "mapillary_image_description.json" is a good approach, what would be the reason not to do so in mapillary_tools right away, and skip processing if found? That seems simpler than users scripting their own hacks, and there could be a status message when the file is found, thus making the behaviour very obvious.

ptpt commented 1 year ago

If checking for "mapillary_image_description.json" is a good approach, what would be the reason not to do so in mapillary_tools right away, and skip processing if found?

Think of mapillary_tools process image_dir as exiftool image_dir whose job is to extract metadata from images in the folder and write them out. Then checking this output file might be confusing to users.

I will keep the issue open until a good solution. For now, I would suggest script it:

image_dir="$1"
desc_path="${image_dir}/mapillary_image_description.json"
# check if the desc file exists
if ! [ -f "$desc_path" ]; then
    mapillary_tools process "$image_dir" $gt --desc_path="$desc_path" --interpolate_directions --skip_process_errors
fi
mapillary_tools upload "$image_dir" --desc_path="$desc_path" --user_name richlv

to replace

echo "y" | mapillary_tools process_and_upload --user_name richlv --interpolate_directions "$1" $gt --skip_process_errors
richlv commented 1 year ago

Have we observed specific scenarios where it would be confusing? Wondering whether the current overhead isn't causing more confusion :) A long time ago I added status checks in mapillary_tools and did a pull request, but that was before the full rewrite or two, won't be able to do anything merge-worthy now.

ptpt commented 1 year ago

@richlv could you try out https://github.com/mapillary/mapillary_tools/releases/tag/v0.10.2? The image processing performance is improved significantly in this version (700 images per second on my laptop MBP M1).

richlv commented 1 year ago

Thanks, that indeed is faster - greatly appreciated :) It would still be great to automatically avoid re-processing. Just this week I was several times restarting uploads with 10+ directories, 2GB of images per directory.

Is the potential for confusion easy to avoid, if it is still a concern?