Aggregating raw data - Githubissues

PieterjanVerhelst commented 8 years ago

I downloaded the Raw folder from the Drive to check if all my files are in the Raw folderr (simple copy paste and order by date). However, I have 476 csv files.These are mainly files that are part of the historical data file (the latter is a csv output from the Vemco VUE software program, since I had more vrl files than csv files for historical data). Since it is a lot of work to check which csv files are already present in the historical data and which are not, would it be possible to remove double records in the aggregated file? Something like 'if date, time, receiver ID and tag ID are equal for 2 or more records, remove the first'.

peterdesmet commented 8 years ago

Removing them at the aggregating step (in @bartaelterman script) is one option. Another would be if I generate a summary file from the historical data, with:

receiver, date of first detection, date of last detection, number of detections.

Would that make it easier to assess which files are present in the historical data? If so, do you want to keep the "original" file or those detections in the historical data file?

bartaelterman commented 8 years ago

Duplicate rows can be removed. However, they are not always due to the fact that the file was duplicated. There are examples where a transmitter is detected twice by the same receiver on the exact same time. So in that case, the record is already duplicated in one file. I should search in the data to find the example again, but this is something I signalled to Ans before. Ans mentioned that in certain situations the number of detections during a certain time interval is calculated and that removing those detections would affect that calculation. We never really agreed on how to treat those records though.

PieterjanVerhelst commented 8 years ago

Bart, could you send me the example if you find it? I think he ment the file resulting from the residency search in VUE, not the raw data?

bartaelterman commented 8 years ago

For example in file VR2W_126197_20150824_1.csv, lines 345, 346 and 347 are identical.

bartaelterman commented 8 years ago

Note that these records have no seconds, so it is very well possible that the fish was detected three times. But still, the record is duplicated.

PieterjanVerhelst commented 8 years ago

I was not aware of this. In that case, we better control the files at import level. @peterdesmet I'll try to figure out which files were uploaded for the historical data by looking at the latest detection date of the historical data, so you don't need to make a summary.

robinvliz commented 8 years ago

One not, I have also come across examples of this. Seconds that aren't filled in. I would leave these as is, although this is duplicate data. We are pretty sure that the fish has been seen 3 times in that second (if the tag ping delays are also in this time window). I would still allow this to be uploaded to the database so you can filter this during the analysis (or not).

PieterjanVerhelst commented 8 years ago

I organised the detection files as followed: 1) 2014 and later: historical data (which also has detections in 2015) + 3 more 2014-files 2) 2015 and later: all detection files from Raw and Verified from 2015.

I made a copy of the files, so the orginals are still in Verified and Raw, but can be deleted if you wish.

@IPauwels can you check with Karen if all files were uploaded on the Drive? 3 files from 2014 are rather low. The historical data is based on data on my pc (marine, estuarine and Boekhoute; so no river lamprey or Demer).

PieterjanVerhelst commented 8 years ago

I checked the double detections and this is because some tags emit multiple signals within 1 minute (min delay 15s - max delay 30s). In VUE these detections are seen as different detections at seconds level. Nonetheless this is not the case for the csv files.

PieterjanVerhelst commented 8 years ago

UPDATE I added the last INBO detection files in the 2014 and 2015 folder. Ine uploaded the river lamprey files in the River lamprey Ine 11 dec 2015 folder. I checked and completed the deployments. However, some fields are still missing. For example:

BPNS VR2W-450111 FALSE Boei bij Reefballs C-Power bpns-RBCPOWER

For this receiver it is not known when the receiver was removed, so the 'removed_at' field is left blank. Is it possible to add the date of the last detection of that receiver on that location with a script? Also, the lattitude and longitude for the river lamprey metadata are missing and will take some time to figure it out. Would it be possible to start running the script without these data, so we can deliver the files to VLIZ this month?

IPauwels commented 8 years ago

Just to inform: currently there are no detection files from INBO projects which are not uploaded on the drive. The first coming detection files will be those of the receivers in Albertkanaal.

PieterjanVerhelst commented 8 years ago

@IPauwels ok, can you inform Karen to put them in the '2015 and after' folder?

IPauwels commented 8 years ago

I will inform Karen. From now on and till the database is operating, we will upload new detection files in '2015 and after' on the drive.

bartaelterman commented 8 years ago

What are all the files that are in the "Raw Data" folder itself? (so not in "2014 and before" or "2015 and after") Do they have to be moved to one of these folders?

PieterjanVerhelst commented 8 years ago

I copied the files to the 2014 and 2015 folders instead of cut & paste (just in case). But they can be removed.

peterdesmet commented 8 years ago

Removed them all.

bartaelterman commented 8 years ago

My script is complaining about a missing station name for VR2W-122322 in the file VR2W_122322_20140619_1.csv. This file does not contain a correct station name, nor an old station name (in this case, the new station name should be de-17 and the old station name should be SO_SB_LOZB)

I'll add the receiver id to this line in station_names. This means I have to update the station name mapping a bit.

Previously, it looked at the station_names mapping file, and filled in empty old_name fields with receiver_id. Next, it replaced old station names in the detections file using the old station names column in the station names mapping file. However, if the station name in the detection file was empty, the receiver name (or S/N) was used.

Now, the mapping happens slightly different because in this particular case, both the old_name and the receiver_id are known (so both fields are filled in in station_mapping whereas previously either old_name or receiver_id were filled in ) but the script tries to map on the old_name first, and on the receiver_id if no match was found using old_name.

bartaelterman commented 8 years ago

Woops, I'll have to take that back. There are two records in our Fish tracking receivers file for this receiver. Station de-17 and s-2a. Based on the deployment dates I assume it is de-17 since I'm only using data from 2014 or before.

PieterjanVerhelst commented 8 years ago

19 files did not have a station name and could not be mapped since the receivers could have 2 possible station names in the receiver metadata in the Drive. I added the correct station name manually in a copied file and placed the original file in the folder 'Files without station name' under the Raw folder.

bartaelterman commented 8 years ago

Yes. However, for the 19 receivers that caused the problem, there were also other files with this issue. We mapped everything to their "demer" station names, because all of them were deployed there during 2014 and removed only in 2015. However, this means that the station mapping file will not be valid for data from 2015.

inbo / fish-tracking

Aggregating raw data #49