Closed blaueente closed 3 years ago
Thanks. Make total sense.
How large is the generated zip file usually? The streaming was the my first trial but I found the upload API doesn't support streaming (it asks for content length specified).
I personally had ~2GB zip files. Sequence is was exactly 500 Pictures, and that is enforced by the upload script. So I guess overall size depends on how large the image files are; I can imagine that images from newer phones with 64 Megapixels might also be more like 20 MB, making the zip size more like 10GB
In theory, if compression is switched off for all pictures, the size should be deterministic, and one should be able to calculate the overall zip size beforehand using all the file sizes. Switching off compression completely would be no loss, as .jpg photos are almost never compressible. None was compressed in my mapillary upload zip yesterday.
Hey @blaueente it's not reliable to predicate zip size even switch off compression, because it also writes other metadata to the ip file.
I ended up with the tempfile solution. Could you test it out?
Great, thanks! I'll test it next time I'll have pictures to upload.
I just looked at the ZIP spec again, and I understand the problem is an unpredictable number of "extra field"s of unpredictable lenghts added by the python zip library. However! Using a pure python implementation such as in https://github.com/kbbdy/zipstream/blob/master/zipstream/zipstream.py, the number of "extra field"s is under control (they just don't use any). It seems to be straightforward to exactly predict the size of a zip file produced by this library.
The only complication is that this particular library does not (yet) support zip64, limiting the zip size to 4GB overall. However, it might be possible to add zip64 support, as - assuming individual jpg files are < 4GB - it is likely to only require some additional zip64 record for each file at the end directory.
What do you think of this approach? There is no license in github, but according to https://github.com/kbbdy/zipstream/blob/master/setup.py it is BSD licensed.
I would also like to echo here.
pre 0.7.0 upload console output was neat and clean, allowing user to see: a. number of sequences being uploaded b. number of images per each sequence uploaded c. individual sequence upload progress d. number of images skipped from upload (flagged earlier as duplicates)
p.s. - what is the process to get LATEST code with all merged fixes?
pip install --upgrade git+https://github.com/mapillary/mapillary_tools
does nothing (says 0.7.0 already installed and not fetching any pre-release hotfixes)
I concur. Specifically, I liked the possibility to abort the upload at any point in time and resume it later. This should be addressed with the v4 API, because it might not be possible to upload 10GB (if you have 20MB pics) in one go.
I noticed that upload rate was not limited anymore compared to v3, which seems to be one major advantage of uploading huge files to facebook servers.
The re-upload could, of course, be addressed by using Content-Range, but I think this might not be standard for HTTP Post and needs to be specifically implemented or enabled on the facebook server. This of course also requires a predictable zip, so either a tempfile, or, the list of files must be stored to enable a stable zip generation as before. If some image files have been deleted in the meantime, they need to be included as, e.g. blank files, in order to ensure that the zip stays as is.
I ended up with the tempfile solution. Could you test it out?
It seems to work. However, I am not sure if I like the approach with tempfile.TemporaryFile too much. Maybe tempfile.NamedTemporaryFile would be nicer? or even tempfile.mkstemp and then remove it only if the upload is really completed, to (later) allow for interruptions, such as mentioned in https://github.com/mapillary/mapillary_tools/issues/405#issuecomment-863423175 ?
Hey @blaueente @GPSMapper could you try the PR above?
pip install --upgrade git+https://github.com/mapillary/mapillary_tools@resumable-upload
should upgrade
should upgrade
does not work for me - itr seems tools are not updated. I am running on Windows 10 machine. Maybe there is a different path to force upgrade? All files timestamps remain as June 11 (last update to 0.7.0)
Run log:
C:\Users\waww>pip install --upgrade git+https://github.com/mapillary/mapillary_tools#resumable-upload
Collecting git+https://github.com/mapillary/mapillary_tools#resumable-upload
Cloning https://github.com/mapillary/mapillary_tools to c:\users\waww\appdata\local\temp\pip-req-build-rgk92ll8
Running command git clone -q https://github.com/mapillary/mapillary_tools 'C:\Users\waww\AppData\Local\Temp\pip-req-build-rgk92ll8'
Collecting Piexif@ git+https://github.com/mapillary/Piexif
Cloning https://github.com/mapillary/Piexif to c:\users\waww\appdata\local\temp\pip-install-pcl0rv_8\piexif_9e61ec6dc1834c1c9772fe86ace3f652
Running command git clone -q https://github.com/mapillary/Piexif 'C:\Users\waww\AppData\Local\Temp\pip-install-pcl0rv_8\piexif_9e61ec6dc1834c1c9772fe86ace3f652'
Requirement already satisfied: exifread==2.1.2 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (2.1.2)
Requirement already satisfied: gpxpy==0.9.8 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (0.9.8)
Requirement already satisfied: pymp4==1.1.0 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (1.1.0)
Requirement already satisfied: pynmea2==1.12.0 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (1.12.0)
Requirement already satisfied: python-dateutil==2.7.3 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (2.7.3)
Requirement already satisfied: pytz in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (2021.1)
Requirement already satisfied: requests==2.20.0 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (2.20.0)
Requirement already satisfied: tqdm==2.2.4 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (2.2.4)
Requirement already satisfied: tzwhere in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from mapillary-tools==0.7.0) (3.0.3)
Requirement already satisfied: construct==2.8.8 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from pymp4==1.1.0->mapillary-tools==0.7.0) (2.8.8)
Requirement already satisfied: six>=1.5 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from python-dateutil==2.7.3->mapillary-tools==0.7.0) (1.15.0)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->mapillary-tools==0.7.0) (1.24.3)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->mapillary-tools==0.7.0) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->mapillary-tools==0.7.0) (2020.12.5)
Requirement already satisfied: idna<2.8,>=2.5 in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from requests==2.20.0->mapillary-tools==0.7.0) (2.7)
Requirement already satisfied: shapely in c:\users\waww\appdata\local\programs\python\python39\lib\site-packages (from tzwhere->mapillary-tools==0.7.0) (1.7.1)
Thanks for the fast reply! Could you try:
pip install --upgrade --force-reinstall git+https://github.com/mapillary/mapillary_tools@resumable-upload
Could you try:
Yes, this worked (forced reinstall from specific branch). Files are are now updated, but I don't have any unprocessed files at the moment, so will do an upload test tomorrow.
Basic information
0.7.0
Linux
Problem
During process_and_upload, a huge temporary zip file is generated in a place below the images. That means, image data is copied around uselessly with high I/O load which is slow for HDDs and wears out SSDs, and both problems apply for microSD cards.
Mitigation
Offer the location of the temp file to be selected as an option.
Solution
Use streaming, i.e., generate the file on-the-fly as it is uploaded. e.g. as demonstrated here (from a quick web search, I have not checked those, maybe there are other, better solutions): https://stackoverflow.com/questions/10405210/create-and-stream-a-large-archive-without-storing-it-in-memory-or-on-disk https://github.com/kbbdy/zipstream