Add new station names to raw data files

peterdesmet commented 8 years ago

For all raw data files, populate StationName with the correct values
After manual verification, move the files to the verified folder.

@bartaelterman, what would be the best approach to do step one with a script?

peterdesmet commented 8 years ago

file	old name	new name
VR2C69_450114_20150728_1.csv	B05 Belwind	bpns-B05BELWIND
VR2W_110779_20150626_1.csv	VG-2	bpns-VG2
VR2W_110783_20150626_1.csv	S4	bpns-S4
VR2W_110784_20150626_1.csv	WK12	bpns-WK12
VR2W_112295_20150615_1.csv	s-8-1	NOT FOUND
VR2W_113521_20150615_1.csv	s-4-1	NOT FOUND
VR2W_113528_20150615_1.csv	s-9-1	NOT FOUND
VR2W_115428_20150625_1.csv	O 6	NOT FOUND
VR2W_115428_20150625_2.csv	O 6	NOT FOUND
VR2W_115430_20150615_1.csv	s-5-1	NOT FOUND
VR2W_115441_20150615_1.csv	s-6-1	NOT FOUND
VR2W_115442_20150615_1.csv	s-7-1	NOT FOUND
VR2W_119047_20150609_1.csv		ma-8
VR2W_119048_20150609_1.csv		ma-6
VR2W_119049_20150609_1.csv		ma-9
VR2W_119052_20150609_1.csv		ma-7
VR2W_119056_20150609_1.csv		ma-5
VR2W_119057_20150609_1.csv		ma-1
VR2W_120092_20150609_1.csv	120092	ma-4
VR2W_120092_20150609_2.csv	120092	ma-4
VR2W_120092_20150609_3.csv	120092	ma-4
VR2W_120095_20150609_1.csv	120095	ma-2
VR2W_120873_20150901_1.csv	ws-pvtss	ws-PVTSS
VR2W_122325_20150615_1.csv	S-3-1	NOT FOUND
VR2W_122339_20150616_1.csv	s-4c-1	NOT FOUND
VR2W_122363_20150615_1.csv	S-4A-1	NOT FOUND
VR2W_122367_20150616_1.csv	s-4b-1	NOT FOUND
VR2W_123823_20150626_1.csv	WK14	bpns-WK14
VR2W_123824_20150626_1.csv	W1	bpns-W1
VR2W_123826_20150626_1.csv	WZ	bpns-WZ
VR2W_123829_20150626_1.csv	S7	bpns-S7
VR2W_126194_20150520_1.csv	126194	ak-42
VR2W_126194_20150824_1.csv	126194	ak-42
VR2W_126195_20150520_1.csv	126195	ak-41
VR2W_126196_20150824_1.csv	126196	ak-44
VR2W_126197_20150824_1.csv	126197	ak-45

peterdesmet commented 8 years ago

Files with NOT FOUND = couldn't find station name directly. Will look those up with the receiver code.

peterdesmet commented 8 years ago

I noticed the NOT FOUND are dependant on date, @PieterjanVerhelst, maybe it is better if you map those.

PieterjanVerhelst commented 8 years ago

I filled in the new names (last column): VR2W_112295_20150615_1.csv s-8-1 s-8 VR2W_113521_20150615_1.csv s-4-1 s-4 VR2W_113528_20150615_1.csv s-9-1 s-9 VR2W_115428_20150625_1.csv O 6 bpns-OH6 VR2W_115428_20150625_2.csv O 6 bpns-OH6 VR2W_115430_20150615_1.csv s-5-1 s-5 VR2W_115441_20150615_1.csv s-6-1 s-6 VR2W_115442_20150615_1.csv s-7-1 s-7 VR2W_122325_20150615_1.csv S-3-1 s-3 VR2W_122339_20150616_1.csv s-4c-1 s-4c VR2W_122363_20150615_1.csv S-4A-1 s-4a VR2W_122367_20150616_1.csv s-4b-1 s-4b

bartaelterman commented 8 years ago

I have some code to read raw input files. It detects the format based on the headers, so it knows in which columns the values need to be changed. I also have a command line script that does the aggregation. So I would suggest to update these script and have them change the station names too and write these files to the verified folder. I'll leave the files in the Raw folder, so after manual validation, you can remove them.

bartaelterman commented 8 years ago

@peterdesmet what do you mean: they are dependant on the date?
I cannot map empty old names to new names ma-8, ma-6, ma-9 etc. What is the meaning of those?

PieterjanVerhelst commented 8 years ago

@bartaelterman : let me know when the raw files where verified. Afterwards I will check them in the raw folder and delete them. Considering the Meuse data (ma-8, ma-6 etc), probably no station name was given to the receiver when deployed (I don't have that data on my pc, but I'll check them monday at INBO). Is it possible to add the new station name based on receiver ID for those files?

bartaelterman commented 8 years ago

Yes @PieterjanVerhelst, I can add the new station name based on receiver id. The receiver id is:

take the file name
split on "_"
take the first two fields
join them with a "-"

?

PieterjanVerhelst commented 8 years ago

There are some faults in the metadata about the Meuse receivers (ma-x); I am trying to correct it by the end of the week. I reconsider my recommendation to add a new station name based on receiver id. As the station_name reflects a location, I think it would be better to add the new station name based on the coordinates (also here some wrong coordinates in the metadata; as soon as they are changed into the right ones, I'll post them here).

PieterjanVerhelst commented 8 years ago

Here are the receiver_id's with the matching station names. Some receivers were removed and changed by another receiver for the same location and got deploy number '2'.

receiver_id is_active station_name deploy_number VR2W-119047 FALSE ma-8 1 VR2W-119048 FALSE ma-6 1 VR2W-119049 FALSE ma-9 1 VR2W-119052 FALSE ma-7 1 VR2W-119056 FALSE ma-5 1 VR2W-119057 FALSE ma-1 1 VR2W-120092 FALSE ma-4 1 VR2W-120095 FALSE ma-2 1 VR2W-122324 FALSE ma-3 1 VR2W-124065 TRUE ma-5 2 VR2W-124066 TRUE ma-4 2 VR2W-124076 TRUE ma-2 2 VR2W-124078 TRUE ma-1 2 VR2W-122324 TRUE ma-3 2

bartaelterman commented 8 years ago

Some receivers were removed and changed by another receiever for the same location and got deploy number '2'

Did the opposite happen too? A receiver was redeployed on a different location?

PieterjanVerhelst commented 8 years ago

It was removed and deployed again, but still at the same location (ma-3)

bartaelterman commented 8 years ago

Ok. So the station code stays the same then.

I'll write a script to substitute the station names and will send you a file this afternoon.

bartaelterman commented 8 years ago

I am wondering...

Replacing the old station names by new ones is ok. After some iterations, no old station names will be found in the input anymore, and this step will eventually become obsolete.

Setting the station name based on the receiver id works differently since the receiver id always stays the same. So this action will always remain active. If at some point in the future we do move one receiver from one station to another, we are in trouble. Are we absolutely sure this is how we want to process the raw data?

PieterjanVerhelst commented 8 years ago

In the future, the correct station name will come with the csv file, so this step will be unnecessary. I would not set the station name based on the receiver ID for the above mentionned reason: receivers will be translocated in the future (old projects end, new ones arise). Only in the exceptional case of the Meuse receivers (see above), because no station name was given to the receivers, so the only info we have in the csv file is the receiver ID.

bartaelterman commented 8 years ago

That's exactly my point. I cannot implement this exceptional case in a script. Setting the station name of these receivers will need to be done manually.

PieterjanVerhelst commented 8 years ago

I will change this in the csv files and drop them in verified folder. Afterwards, I will delete them from the Raw folder.

PieterjanVerhelst commented 8 years ago

The station names for the Meuse receivers were added and the files are in the Raw folder (as well as the original files). If ok for you, I will delete the old files.

peterdesmet commented 8 years ago

Where are we with this step? Anything I need to do?

peterdesmet commented 8 years ago

I also noticed doubles in the raw folder: the csv file and a google spreadsheet of the same csv file. Can the google spreadsheets be removed?

PieterjanVerhelst commented 8 years ago

Indeed, the google spread sheets can be removed.

peterdesmet commented 8 years ago

Now removed.

bartaelterman commented 8 years ago

Almost there.

I have a couple of files that don't contain a station name, only a receiver id. For at least the following ids, I would need a new station name.

VR2W-119047
VR2W-119048
VR2W-119049
VR2W-119052
VR2W-119056
VR2W-119057

These can be added to the station names file in the receiver_id and new_name. @peterdesmet can you add these?

peterdesmet commented 8 years ago

Done. See commit above.

bartaelterman commented 8 years ago

With that, I can validate all data in the Raw folder. How shall we go from here:

Is the 2. Verified folder still needed?
@PieterjanVerhelst asked for a merged but not aggregated file. Shall we give these temporary dumps a permanent place? Where? I made a folder 2a. Verified+Consolidated and will give the file a name including the current timestamp. Is that ok?
Shall I run a new aggregation job and update the data in 3. Aggregated?

peterdesmet commented 8 years ago

No longer needed I think
I wouldn't give them a permanent place. Maybe a tmp folder? The file name you propose seems OK to me
Yep, you can update the aggregated data. I would also drop them in tmp

If @PieterjanVerhelst agrees with the above, I would just have:

raw
tmp

PieterjanVerhelst commented 8 years ago

@bartaelterman: you have got the new station names for the above mentionned receivers? As these are in the file station_names.csv ?

Still some receivers do not have the correct station name applied to the receiver (those from the Meuse and Albert Channel). Therefore, it would be better to keep the raw and verified folder.
Indeed, I would like to have a file with all detections. Maybe we can order maps in following way: 1) Raw 2) Verified 3) Consolidated 4) Aggregated I am thinking of doing the aggregation in a pre-analysis phase instead of post-processing (depending on the research question, aggregation can be perforrmed). So this map could be unnecessary in the future.
Ok

bartaelterman commented 8 years ago

@PieterjanVerhelst Jep, I have the station names for the above mentioned receivers.

@PieterjanVerhelst how would you suggest to do the mapping then? Would you manually edit certain files then? Because I would rather keep everything in the script. Then the cleaning happens all in one place and everything is reproducible. So can you exactly document how the mapping should be updated?
@peterdesmet you would not give these files a permanent place? So you disagree with @PieterjanVerhelst suggestion?
OK, we'll see where I drop them, depending on 2.

PieterjanVerhelst commented 8 years ago

This can be done manually, as it are only a few receivers. We could remove the 'Raw' folder, but the chance exists that a file without the correct station name and coordinates gets in the 'Verified' folder. However, if the script can give an error and indicates which file/receiver has an incorrect station name, the problem can be solved quickly.
I discussed this today with Jan Reubens and we both think that the aggregation would be a part of the pre-analysis of the data. However, until now we would like both consolidated and aggregated to find out what works best.

bartaelterman commented 8 years ago

If it can be done automatically (by the script) I'd rather do it that way. The script currently checks the station name. It needs to be filled in, and of the form text-textOrNumbers (or in regular expression syntax: ^[a-zA-Z]+-[0-9a-zA-Z]+$. If something does not match those criteria, an error is raised. I can add in more checks if you like, if that could confidently get us to a situation where we don't have to touch files manually.
Yes, it's no problem for me where the aggregation happens. Let's currently stick with the directories 1. Raw, 2. Consolidated, 3. Aggregated. (I'm dropping Verified, since everything happens automatically, so there is no manual verification more in the process)

peterdesmet commented 8 years ago

Agree with @bartaelterman: let's try to cover everything with a script if we can.

Also agree to drop Verified, as it is the output of raw + script, so it can always be done again.

bartaelterman commented 8 years ago

So @PieterjanVerhelst can you add the mapping of the missing receivers to station_names.csv?

peterdesmet commented 8 years ago

To clarify, those that you mentioned in:

Still some receivers do not have the correct station name applied to the receiver (those from the Meuse and Albert Channel).

PieterjanVerhelst commented 8 years ago

Agree, we can use the script. I added the receivers with the correct station name. So when the file will be uploaded, no station name nor coordinates come with the file; only the receiver ID. Therefore the mapping can only happen on receiver ID in these cases.
Considering the database of VLIZ, we would put the files located in the Verified folder in the VLIZ database. Maybe these can be moved to the raw folder, where all the separate csv files are present?

bartaelterman commented 8 years ago

Ok
Wouldn't it be more interesting to put the files coming from the 2. Consolidated folder in the VLIZ database? Then they only have 1 format to worry about. All dates will be in the same format and all data is checked.

PieterjanVerhelst commented 8 years ago

Consolidated would contain the separate csv files? In that case ok. I thought in that map the concatened file without aggregation would be dropped.

bartaelterman commented 8 years ago

Consolidated would contain 1 file with all records in 1 format. Not aggregated, but not separate files.

Why is it important to have separate files?

PieterjanVerhelst commented 8 years ago

I don't know. I will check this with VLIZ.

PieterjanVerhelst commented 8 years ago

It would be important to have the separate verified files to check the consolidated file if data is missing. I just spoke to Robin who is building the VLIZ database. This data base is ready, so if the files in the Raw folder could be transformed into verified, the files can be dropped in the database as a test. Considering dataflow, the final system would work as follows: 'raw' csv files are dropped on an online interface. Then, the file will be processed to a 'verified' file (by coupling data with metadata; no script needed), which will be dropped in the database. As such, no consolidated file is needed. I think we should have a second meeting with Robin after he tested the database with the verified files from the Drive.

inbo / fish-tracking

Add new station names to raw data files #43