pinskylab / OceanAdapt

Scripts and data for OceanAdapt website to visualize shifts in marine animal distributions
http://oceanadapt.rutgers.edu
MIT License
13 stars 6 forks source link

column classes in raw data affecting meta data #20

Open rBatt opened 9 years ago

rBatt commented 9 years ago

E.g., in the AI data set, the HAUL column has integer values, but due to the trailing and leading white space, those integers are interpreted as characters.

Similarly, a lot of these "integers" are really factors. Or characters. It should probably be addressed then that some of these units in the meta data don't really have much meaning.

@mpinsky To correct this we'd have to go into where we read the raw data and put it together (merge), and then set the column classes. But that's going to be hard to do until #1

mpinsky commented 9 years ago

I think the key is that the metadata describes the data in the raw files accurately. We want a human to be able to read the metadata and understand what they're looking at in the files.

On Tue, May 5, 2015 at 2:12 PM, Ryan Batt notifications@github.com wrote:

E.g., in the AI data set, the HAUL column has integer values, but due to the trailing and leading white space, those integers are interpreted as characters.

Similarly, a lot of these "integers" are really factors. Or characters. It should probably be addressed then that some of these units in the meta data don't really have much meaning.

@mpinsky https://github.com/mpinsky To correct this we'd have to go into where we read the raw data and put it together (merge), and then set the column classes. But that's going to be hard to do until #1 https://github.com/mpinsky/OceanAdapt/issues/1

— Reply to this email directly or view it on GitHub https://github.com/mpinsky/OceanAdapt/issues/20.

rBatt commented 9 years ago

As it turns out, this is really important. I can't properly define the levels of a code (e.g., 4 = eastern time zone in gmex TIME_ZN) if the column is an integer instead of a factor. Similarly, I can't define the date format (e.g., 132 in gmex TIME_MIL is an HHMM format [or HMM??]) if it think s it's an integer.

Can't really solve this until #1

rBatt commented 9 years ago

@mpinsky So far I'm just changed the column classes for the example data set to allow things to make sense.

A good example is what I do here for WCTRI: wctri.data[,"HAUL_TYPE"] <- as.character(wctri.data[,"HAUL_TYPE"])

Which is so that I can define the factor, as you can see here: "HAUL_TYPE" = c( "0"="opportunistic", "1"="off-bottom", "3"="standard bottom sample", "4" = "fishing power comparative sample" ),

mpinsky commented 9 years ago

@rBatt This is looking like more detail than we need (defining factor levels). Can you simply reference and include a document that has this information? Because, as you've pointed out, doing this for every column in our datasets will be a huge undertaking.