mapillary / mapillary_tools

Command line tools for processing and uploading Mapillary imagery
BSD 2-Clause "Simplified" License
267 stars 138 forks source link

Huge temp zipfiles with v4 upload #405

Closed blaueente closed 3 years ago

blaueente commented 3 years ago

Basic information

Problem

During process_and_upload, a huge temporary zip file is generated in a place below the images. That means, image data is copied around uselessly with high I/O load which is slow for HDDs and wears out SSDs, and both problems apply for microSD cards.

Mitigation

Offer the location of the temp file to be selected as an option.

Solution

Use streaming, i.e., generate the file on-the-fly as it is uploaded. e.g. as demonstrated here (from a quick web search, I have not checked those, maybe there are other, better solutions): https://stackoverflow.com/questions/10405210/create-and-stream-a-large-archive-without-storing-it-in-memory-or-on-disk https://github.com/kbbdy/zipstream

ptpt commented 3 years ago

Thanks. Make total sense.

How large is the generated zip file usually? The streaming was the my first trial but I found the upload API doesn't support streaming (it asks for content length specified).

blaueente commented 3 years ago

I personally had ~2GB zip files. Sequence is was exactly 500 Pictures, and that is enforced by the upload script. So I guess overall size depends on how large the image files are; I can imagine that images from newer phones with 64 Megapixels might also be more like 20 MB, making the zip size more like 10GB

In theory, if compression is switched off for all pictures, the size should be deterministic, and one should be able to calculate the overall zip size beforehand using all the file sizes. Switching off compression completely would be no loss, as .jpg photos are almost never compressible. None was compressed in my mapillary upload zip yesterday.

ptpt commented 3 years ago

Hey @blaueente it's not reliable to predicate zip size even switch off compression, because it also writes other metadata to the ip file.

I ended up with the tempfile solution. Could you test it out?

blaueente commented 3 years ago

Great, thanks! I'll test it next time I'll have pictures to upload.

I just looked at the ZIP spec again, and I understand the problem is an unpredictable number of "extra field"s of unpredictable lenghts added by the python zip library. However! Using a pure python implementation such as in https://github.com/kbbdy/zipstream/blob/master/zipstream/zipstream.py, the number of "extra field"s is under control (they just don't use any). It seems to be straightforward to exactly predict the size of a zip file produced by this library.

The only complication is that this particular library does not (yet) support zip64, limiting the zip size to 4GB overall. However, it might be possible to add zip64 support, as - assuming individual jpg files are < 4GB - it is likely to only require some additional zip64 record for each file at the end directory.

What do you think of this approach? There is no license in github, but according to https://github.com/kbbdy/zipstream/blob/master/setup.py it is BSD licensed.

GPSMapper commented 3 years ago

I would also like to echo here.

  1. Upload logic has changed drastically, APIV3 upload (0.6.0 and earlier) was supporting streaming upload, so every file was uploaded in a queue (default 5 streams) individually without need to be packaged. If there is a limitation with API V4 not supporting streaming - this should be addressed in first place.
  2. Upload logging is now a complete nonsense - dumping huge amount of meaningless data into the console and not showing real useful details:

pre 0.7.0 upload console output was neat and clean, allowing user to see: a. number of sequences being uploaded b. number of images per each sequence uploaded c. individual sequence upload progress d. number of images skipped from upload (flagged earlier as duplicates)

p.s. - what is the process to get LATEST code with all merged fixes? pip install --upgrade git+https://github.com/mapillary/mapillary_tools does nothing (says 0.7.0 already installed and not fetching any pre-release hotfixes)

blaueente commented 3 years ago

I concur. Specifically, I liked the possibility to abort the upload at any point in time and resume it later. This should be addressed with the v4 API, because it might not be possible to upload 10GB (if you have 20MB pics) in one go.

I noticed that upload rate was not limited anymore compared to v3, which seems to be one major advantage of uploading huge files to facebook servers.

The re-upload could, of course, be addressed by using Content-Range, but I think this might not be standard for HTTP Post and needs to be specifically implemented or enabled on the facebook server. This of course also requires a predictable zip, so either a tempfile, or, the list of files must be stored to enable a stable zip generation as before. If some image files have been deleted in the meantime, they need to be included as, e.g. blank files, in order to ensure that the zip stays as is.

blaueente commented 3 years ago

I ended up with the tempfile solution. Could you test it out?

It seems to work. However, I am not sure if I like the approach with tempfile.TemporaryFile too much. Maybe tempfile.NamedTemporaryFile would be nicer? or even tempfile.mkstemp and then remove it only if the upload is really completed, to (later) allow for interruptions, such as mentioned in https://github.com/mapillary/mapillary_tools/issues/405#issuecomment-863423175 ?

ptpt commented 3 years ago

Hey @blaueente @GPSMapper could you try the PR above?

pip install --upgrade git+https://github.com/mapillary/mapillary_tools@resumable-upload

should upgrade

GPSMapper commented 3 years ago

should upgrade

does not work for me - itr seems tools are not updated. I am running on Windows 10 machine. Maybe there is a different path to force upgrade? All files timestamps remain as June 11 (last update to 0.7.0)

Run log:

C:\Users\waww>pip install --upgrade git+https://github.com/mapillary/mapillary_tools#resumable-upload
Collecting git+https://github.com/mapillary/mapillary_tools#resumable-upload
  Cloning https://github.com/mapillary/mapillary_tools to c:\users\waww\appdata\local\temp\pip-req-build-rgk92ll8
  Running command git clone -q https://github.com/mapillary/mapillary_tools 'C:\Users\waww\AppData\Local\Temp\pip-req-build-rgk92ll8'
Collecting Piexif@ git+https://github.com/mapillary/Piexif
  Cloning https://github.com/mapillary/Piexif to c:\users\waww\appdata\local\temp\pip-install-pcl0rv_8\piexif_9e61ec6dc1834c1c9772fe86ace3f652
  Running command git clone -q https://github.com/mapillary/Piexif 'C:\Users\waww\AppData\Local\Temp\pip-install-pcl0rv_8\piexif_9e61ec6dc1834c1c9772fe86ace3f652'
Requirement already satisfied: exifread==2.1.2 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (2.1.2)
Requirement already satisfied: gpxpy==0.9.8 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (0.9.8)
Requirement already satisfied: pymp4==1.1.0 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (1.1.0)
Requirement already satisfied: pynmea2==1.12.0 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (1.12.0)
Requirement already satisfied: python-dateutil==2.7.3 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (2.7.3)
Requirement already satisfied: pytz in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (2021.1)
Requirement already satisfied: requests==2.20.0 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (2.20.0)
Requirement already satisfied: tqdm==2.2.4 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (2.2.4)
Requirement already satisfied: tzwhere in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (3.0.3)
Requirement already satisfied: construct==2.8.8 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from pymp4==1.1.0->mapillary-tools==0.7.0) (2.8.8)
Requirement already satisfied: six>=1.5 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from python-dateutil==2.7.3->mapillary-tools==0.7.0) (1.15.0)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->mapillary-tools==0.7.0) (1.24.3)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->mapillary-tools==0.7.0) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->mapillary-tools==0.7.0) (2020.12.5)
Requirement already satisfied: idna<2.8,>=2.5 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->mapillary-tools==0.7.0) (2.7)
Requirement already satisfied: shapely in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from tzwhere->mapillary-tools==0.7.0) (1.7.1)
ptpt commented 3 years ago

Thanks for the fast reply! Could you try:

pip install --upgrade --force-reinstall git+https://github.com/mapillary/mapillary_tools@resumable-upload
GPSMapper commented 3 years ago

Could you try:

Yes, this worked (forced reinstall from specific branch). Files are are now updated, but I don't have any unprocessed files at the moment, so will do an upload test tomorrow.