maximlamare / S3_extract

Extract the outputs from the S3 OLCI processor for a given number of S3 files at given coordinates.
MIT License
1 stars 2 forks source link

Memory cache #10

Open widaro opened 4 years ago

widaro commented 4 years ago

Problem discovered using an ESA virtual machine (Ubuntu 18.02.2 LTS 64-bit).

Snappy does not clear its memory cache when the 's3_extract_snow_product.py' is used to extract data from multiple '.SEN3' files. After processing a few '.SEN3' files (dependent on your disposable memory), the program terminates with an "out of memory error". The problem is also described in this thread: https://forum.step.esa.int/t/temporary-fix-for-snappy-memory-issues/9772. I have found two possible workarounds, but none of them are optimal:

  1. use the python Subprocesses module as described in this temporary fix: https://forum.step.esa.int/t/temporary-fix-for-snappy-memory-issues/9772. I have only lightly discovered this method as I noticed that the imported modules are NOT inherited by the script called in the subprocess (the main issue when recalling the execution of a script is that the modules have to be reloaded each time, which takes a substantial amount of time). I think the second workaround below is preferred as it allows more concurrency and only requires small changes to the existing code.

  2. make the 's3_extract_snow_product.py' take a '.txt' file containing the paths of x '.SEN' files as the '-i' parameter (instead of -i being the Satellite image repository). To process many '.SEN3' files at once, multiple '.txt' files can be piped in through the terminal to run the 's3_extract_snow_product.py'. The optimal number of '.SEN3' paths per '.txt' file depends on the number of concurrent runs, available memory, and how well the '.SEN3' files are sorted (percentage '.SEN3' files with successful extraction). I have experienced, when having (#cores*4)GB RAM, that 10 '.SEN3' files per '.txt' input file is a good choice.
    A concurrent run can be made using GNU parallel.

The above explanation might be a bit hard to keep track of, so here is an example on how the second workaround can be used to process multiple '.SEN3' files using GNU. 'in0001.txt':

S3B_OL_1_EFR____20200411T140804_20200411T141104_20200411T160403_0179_037_324_1800_LN1_O_NR_002.SEN3
   ...
   ...
S3A_OL_1_EFR____20180601T014326_20180601T014626_20180602T052411_0180_032_003_1440_LN1_O_NT_002.SEN3

'in0002.txt':

S3A_OL_1_EFR____20180601T014926_20180601T015226_20180602T052455_0179_032_003_1800_LN1_O_NT_002.SEN3
   ...
   ...
S3A_OL_1_EFR____20180601T013946_20180601T014026_20180601T041111_0039_032_003_1080_SVL_O_NR_002.SEN3

both 'file0001' and 'file0002' are located in 'path/to/txt/files'.

To run with GNU (the exact command below has not been tested but something very similar worked): find path/to/txt/files -type f -name "*\.txt" | parallel python s3_extract_snow.py -c coords/path -o output/path -i (Maybe using --tmux in GNU could make the program more robust in case of an error as it then uses a new terminal for each run).

This will output two '.csv' files. The output '.csv' files could be named after the input file names (so in the example the output files should be named something like 'out0001.csv ' and 'out0002.csv').

Note: for this to work it is required that the " #9 Hard drive partition running out of space" issue is solved first.