Open peterdesmet opened 8 years ago
file | old name | new name |
---|---|---|
VR2C69_450114_20150728_1.csv | B05 Belwind | bpns-B05BELWIND |
VR2W_110779_20150626_1.csv | VG-2 | bpns-VG2 |
VR2W_110783_20150626_1.csv | S4 | bpns-S4 |
VR2W_110784_20150626_1.csv | WK12 | bpns-WK12 |
VR2W_112295_20150615_1.csv | s-8-1 | NOT FOUND |
VR2W_113521_20150615_1.csv | s-4-1 | NOT FOUND |
VR2W_113528_20150615_1.csv | s-9-1 | NOT FOUND |
VR2W_115428_20150625_1.csv | O 6 | NOT FOUND |
VR2W_115428_20150625_2.csv | O 6 | NOT FOUND |
VR2W_115430_20150615_1.csv | s-5-1 | NOT FOUND |
VR2W_115441_20150615_1.csv | s-6-1 | NOT FOUND |
VR2W_115442_20150615_1.csv | s-7-1 | NOT FOUND |
VR2W_119047_20150609_1.csv | ma-8 | |
VR2W_119048_20150609_1.csv | ma-6 | |
VR2W_119049_20150609_1.csv | ma-9 | |
VR2W_119052_20150609_1.csv | ma-7 | |
VR2W_119056_20150609_1.csv | ma-5 | |
VR2W_119057_20150609_1.csv | ma-1 | |
VR2W_120092_20150609_1.csv | 120092 | ma-4 |
VR2W_120092_20150609_2.csv | 120092 | ma-4 |
VR2W_120092_20150609_3.csv | 120092 | ma-4 |
VR2W_120095_20150609_1.csv | 120095 | ma-2 |
VR2W_120873_20150901_1.csv | ws-pvtss | ws-PVTSS |
VR2W_122325_20150615_1.csv | S-3-1 | NOT FOUND |
VR2W_122339_20150616_1.csv | s-4c-1 | NOT FOUND |
VR2W_122363_20150615_1.csv | S-4A-1 | NOT FOUND |
VR2W_122367_20150616_1.csv | s-4b-1 | NOT FOUND |
VR2W_123823_20150626_1.csv | WK14 | bpns-WK14 |
VR2W_123824_20150626_1.csv | W1 | bpns-W1 |
VR2W_123826_20150626_1.csv | WZ | bpns-WZ |
VR2W_123829_20150626_1.csv | S7 | bpns-S7 |
VR2W_126194_20150520_1.csv | 126194 | ak-42 |
VR2W_126194_20150824_1.csv | 126194 | ak-42 |
VR2W_126195_20150520_1.csv | 126195 | ak-41 |
VR2W_126196_20150824_1.csv | 126196 | ak-44 |
VR2W_126197_20150824_1.csv | 126197 | ak-45 |
Files with NOT FOUND = couldn't find station name directly. Will look those up with the receiver code.
I noticed the NOT FOUND are dependant on date, @PieterjanVerhelst, maybe it is better if you map those.
I filled in the new names (last column): VR2W_112295_20150615_1.csv s-8-1 s-8 VR2W_113521_20150615_1.csv s-4-1 s-4 VR2W_113528_20150615_1.csv s-9-1 s-9 VR2W_115428_20150625_1.csv O 6 bpns-OH6 VR2W_115428_20150625_2.csv O 6 bpns-OH6 VR2W_115430_20150615_1.csv s-5-1 s-5 VR2W_115441_20150615_1.csv s-6-1 s-6 VR2W_115442_20150615_1.csv s-7-1 s-7 VR2W_122325_20150615_1.csv S-3-1 s-3 VR2W_122339_20150616_1.csv s-4c-1 s-4c VR2W_122363_20150615_1.csv S-4A-1 s-4a VR2W_122367_20150616_1.csv s-4b-1 s-4b
I have some code to read raw input files. It detects the format based on the headers, so it knows in which columns the values need to be changed. I also have a command line script that does the aggregation. So I would suggest to update these script and have them change the station names too and write these files to the verified folder. I'll leave the files in the Raw folder, so after manual validation, you can remove them.
ma-8
, ma-6
, ma-9
etc. What is the meaning of those?@bartaelterman : let me know when the raw files where verified. Afterwards I will check them in the raw folder and delete them. Considering the Meuse data (ma-8, ma-6 etc), probably no station name was given to the receiver when deployed (I don't have that data on my pc, but I'll check them monday at INBO). Is it possible to add the new station name based on receiver ID for those files?
Yes @PieterjanVerhelst, I can add the new station name based on receiver id. The receiver id is:
take the file name
split on "_"
take the first two fields
join them with a "-"
?
There are some faults in the metadata about the Meuse receivers (ma-x); I am trying to correct it by the end of the week. I reconsider my recommendation to add a new station name based on receiver id. As the station_name reflects a location, I think it would be better to add the new station name based on the coordinates (also here some wrong coordinates in the metadata; as soon as they are changed into the right ones, I'll post them here).
Here are the receiver_id's with the matching station names. Some receivers were removed and changed by another receiver for the same location and got deploy number '2'.
receiver_id is_active station_name deploy_number VR2W-119047 FALSE ma-8 1 VR2W-119048 FALSE ma-6 1 VR2W-119049 FALSE ma-9 1 VR2W-119052 FALSE ma-7 1 VR2W-119056 FALSE ma-5 1 VR2W-119057 FALSE ma-1 1 VR2W-120092 FALSE ma-4 1 VR2W-120095 FALSE ma-2 1 VR2W-122324 FALSE ma-3 1 VR2W-124065 TRUE ma-5 2 VR2W-124066 TRUE ma-4 2 VR2W-124076 TRUE ma-2 2 VR2W-124078 TRUE ma-1 2 VR2W-122324 TRUE ma-3 2
Some receivers were removed and changed by another receiever for the same location and got deploy number '2'
Did the opposite happen too? A receiver was redeployed on a different location?
It was removed and deployed again, but still at the same location (ma-3)
Ok. So the station code stays the same then.
I'll write a script to substitute the station names and will send you a file this afternoon.
I am wondering...
Replacing the old station names by new ones is ok. After some iterations, no old station names will be found in the input anymore, and this step will eventually become obsolete.
Setting the station name based on the receiver id works differently since the receiver id always stays the same. So this action will always remain active. If at some point in the future we do move one receiver from one station to another, we are in trouble. Are we absolutely sure this is how we want to process the raw data?
In the future, the correct station name will come with the csv file, so this step will be unnecessary. I would not set the station name based on the receiver ID for the above mentionned reason: receivers will be translocated in the future (old projects end, new ones arise). Only in the exceptional case of the Meuse receivers (see above), because no station name was given to the receivers, so the only info we have in the csv file is the receiver ID.
That's exactly my point. I cannot implement this exceptional case in a script. Setting the station name of these receivers will need to be done manually.
I will change this in the csv files and drop them in verified folder. Afterwards, I will delete them from the Raw folder.
The station names for the Meuse receivers were added and the files are in the Raw folder (as well as the original files). If ok for you, I will delete the old files.
Where are we with this step? Anything I need to do?
I also noticed doubles in the raw folder: the csv file and a google spreadsheet of the same csv file. Can the google spreadsheets be removed?
Indeed, the google spread sheets can be removed.
Now removed.
Almost there.
I have a couple of files that don't contain a station name, only a receiver id. For at least the following ids, I would need a new station name.
VR2W-119047
VR2W-119048
VR2W-119049
VR2W-119052
VR2W-119056
VR2W-119057
These can be added to the station names file in the receiver_id
and new_name
. @peterdesmet can you add these?
Done. See commit above.
With that, I can validate all data in the Raw folder. How shall we go from here:
2. Verified
folder still needed?2a. Verified+Consolidated
and will give the file a name including the current timestamp. Is that ok?3. Aggregated
?tmp
folder? The file name you propose seems OK to metmp
If @PieterjanVerhelst agrees with the above, I would just have:
raw
tmp
@bartaelterman: you have got the new station names for the above mentionned receivers? As these are in the file station_names.csv ?
@PieterjanVerhelst Jep, I have the station names for the above mentioned receivers.
text-textOrNumbers
(or in regular expression syntax: ^[a-zA-Z]+-[0-9a-zA-Z]+$
. If something does not match those criteria, an error is raised. I can add in more checks if you like, if that could confidently get us to a situation where we don't have to touch files manually.1. Raw
, 2. Consolidated
, 3. Aggregated
. (I'm dropping Verified
, since everything happens automatically, so there is no manual verification more in the process)Agree with @bartaelterman: let's try to cover everything with a script if we can.
Also agree to drop Verified
, as it is the output of raw + script, so it can always be done again.
So @PieterjanVerhelst can you add the mapping of the missing receivers to station_names.csv?
To clarify, those that you mentioned in:
Still some receivers do not have the correct station name applied to the receiver (those from the Meuse and Albert Channel).
2. Consolidated
folder in the VLIZ database? Then they only have 1 format to worry about. All dates will be in the same format and all data is checked.Consolidated would contain the separate csv files? In that case ok. I thought in that map the concatened file without aggregation would be dropped.
Consolidated would contain 1 file with all records in 1 format. Not aggregated, but not separate files.
Why is it important to have separate files?
I don't know. I will check this with VLIZ.
It would be important to have the separate verified files to check the consolidated file if data is missing. I just spoke to Robin who is building the VLIZ database. This data base is ready, so if the files in the Raw folder could be transformed into verified, the files can be dropped in the database as a test. Considering dataflow, the final system would work as follows: 'raw' csv files are dropped on an online interface. Then, the file will be processed to a 'verified' file (by coupling data with metadata; no script needed), which will be dropped in the database. As such, no consolidated file is needed. I think we should have a second meeting with Robin after he tested the database with the verified files from the Drive.
StationName
with the correct values@bartaelterman, what would be the best approach to do step one with a script?