Clean files on S3 buckets

mvarewyck commented 1 year ago

I have checked files on S3 bucket inbo-wbe-uat-data on February 20th, 2023.

(1) Some files were copied from the inst/extdata folder, but are currently not being used or became redundant for the app. To keep the buckets clean, I suggest we remove them (temporarily).

"FaunabeheerDeelzones_0000_2018_habitats.csv" -> currently not used
"FaunabeheerDeelzones_2019_9999_habitats.csv" -> currently not used
"Toekenningen_ree.csv" -> not needed(?) we use 'Verwezenlijkt_categorie_per_afschotplan.csv'
"fbz_gemeentes_habitats.csv" -> currently not used
"waarnemingen_2022.csv" -> no longer needed. we use 'waarnemingen_wild_zwijn_processed.csv'

(2) For the file rshiny_reporting_data_ecology.csv it takes locally 15 seconds to load. This is due to the size, but also because we do some data processing in loadRawData(). Ideally, we create a processed file (like we do for waarnemingen_2022.csv) to speed this up. @SanderDevisscher Should we implement a function at our side for cleaning and can you incorporate it in the script at your side?

SanderDevisscher commented 1 year ago

I have checked files on S3 bucket inbo-wbe-uat-data on February 20th, 2023.

(1) Some files were copied from the inst/extdata folder, but are currently not being used or became redundant for the app. To keep the buckets clean, I suggest we remove them (temporarily).

"FaunabeheerDeelzones_0000_2018_habitats.csv" -> currently not used

"FaunabeheerDeelzones_2019_9999_habitats.csv" -> currently not used

"Toekenningen_ree.csv" -> not needed(?) we use 'Verwezenlijkt_categorie_per_afschotplan.csv'

"fbz_gemeentes_habitats.csv" -> currently not used

"waarnemingen_2022.csv" -> no longer needed. we use 'waarnemingen_wild_zwijn_processed.csv'

With exception of "fbz_gemeentes_habitats.csv" I can't imagine needing these files in the future.

(2) For the file rshiny_reporting_data_ecology.csv it takes locally 15 seconds to load. This is due to the size, but also because we do some data processing in loadRawData(). Ideally, we create a processed file (like we do for waarnemingen_2022.csv) to speed this up. @SanderDevisscher Should we implement a function at our side for cleaning and can you incorporate it in the script at your side?

I think it is a good idea to move most, if not all, preprocessing of data to the backoffice. So following this logic I would say it would be nice to have a function that processes ecology (and geography ?) before putting it on the bucket. If its provided I'll incorperate it into our upload data script.

mvarewyck commented 1 year ago

I would prefer the following approach for migrating from the unprocessed to processed files on S3.

I create new (processed) files, with different name in inbo-wbe-uat-data bucket, using a new function createRawData() in #346.
Then we will have temporarily both unprocessed & processed files in the bucket. Once #346 is merged into master, we remove the old (unprocessed) files in inbo-wbe-uat-data bucket.
The master branch will keep reading the old files while #346 will read the new file names.
On INBO side: for next data updates you will need to run an extra processing step on the files you would normally upload (ecology, geography) using the new function createRawData() I'm working on. I'll send more instructions later.

SanderDevisscher commented 1 year ago

@mvarewyck I accidently uploaded "FaunabeheerDeelzones_0000_2018_habitats.csv", "FaunabeheerDeelzones_2019_9999_habitats.csv" & "Toekenningen_ree.csv" tot the UAT - bucket. Can you remove them ?

mvarewyck commented 1 year ago

@mvarewyck I accidently uploaded "FaunabeheerDeelzones_0000_2018_habitats.csv", "FaunabeheerDeelzones_2019_9999_habitats.csv" & "Toekenningen_ree.csv" tot the UAT - bucket. Can you remove them ?

All done. E.g. aws.s3::delete_object(object = "Toekenningen_ree.csv", bucket = "inbo-wbe-uat-data")

mvarewyck commented 1 year ago

Creating preprocessed data can be done using this function iteratively.

@SanderDevisscher To be included in the INBO script:

for (iType in c("eco", "geo", "wildschade", "kbo_wbe", "waarnemingen"))
   createRawData(dataDir = "~/git/reporting-rshiny-grofwildjacht/dataS3", bucket = "inbo-wbe-uat-data", type = iType)

Notes:

dataDir points to the folder with the files that you would currently upload for the ecology, geography etc data. Expected filenames are still the same.
createWaarnemingenData() is deprecated as we can use the function above with type 'waarnemingen'.

expected filenames (as input) per type

eco = "rshiny_reporting_data_ecology.csv",
geo = "rshiny_reporting_data_geography.csv",
wildschade = "WildSchade_georef.csv",
kbo_wbe = "Data_Partij_Cleaned.csv",
waarnemingen = "waarnemingen_wild_zwijn_processed.csv"

mvarewyck commented 1 year ago

When processing the files I normally append _processed to it. However, for waarnemingen it is a bit confusing as the input file is already waarnemingen_wild_zwijn_processed.csv, so the output file has the same name.

@SanderDevisscher (1) Can the input file for waarnemingen be without _processed suffix? or (2) Should I change the suffix for all cleaned files to sth like _clean?

SanderDevisscher commented 1 year ago

For the sake of consistency I would say we go with option 1 and drop the _processed suffix.

mvarewyck commented 1 year ago

Ik heb per ongeluk een foutieve file aangemaakt op S3, maar kan deze niet verwijderen. Vroeger lukte dit wel.

@SanderDevisscher Kan jij deze file verwijderen? Als het bij jou lukt, zal ik aan Bert vragen om mijn rechten aan te passen. aws.s3::delete_object(object = "waarnemingen_wild_zwijn_processedcsv", bucket = "inbo-wbe-uat-data")

SanderDevisscher commented 1 year ago

done

mvarewyck commented 1 year ago

expected filenames (as input) per type

eco = "rshiny_reporting_data_ecology.csv",
geo = "rshiny_reporting_data_geography.csv",
wildschade = "WildSchade_georef.csv",
kbo_wbe = "Data_Partij_Cleaned.csv",
waarnemingen = "waarnemingen_wild_zwijn.csv"

SanderDevisscher commented 1 year ago

@mvarewyck I could not find function "readShapeData" after using

devtools::install_github("inbo/reporting-rshiny-grofwildjacht@318-dashboard-figuren-code", 
                           subdir = "reporting-grofwild", force = TRUE)

rebase ?

mvarewyck commented 1 year ago

@mvarewyck I could not find function "readShapeData" after using

@SanderDevisscher readShapeData() is now called createShapeData() - in accordance with the other create* functions

SanderDevisscher commented 1 year ago

I've implemented the needed logic to preprocess the eco, geo, wildschade, kbo_wbe and waarnemingen files. I'm waiting to test the changes in the UAT environment but it is currently down (504 gateway timeout).
In docker however the datachecks pass without a flaw.

inbo / reporting-rshiny-grofwildjacht

Clean files on S3 buckets #388