Closed rBatt closed 8 years ago
I just did some quick play with subsetting columns etc. I think there might be not a lot of benefit to dropping columns that are highly repeated, because the .RData compression (I'm using "xz") is probably able to compress most of this repeated information quite well.
Something to keep in mind.
We are just going to move the data files outside of the package, so this no longer needs to be done
Need to make the package more lightweight by reducing the size of the associated data sets.
One step here is to drop extra columns where possible.
There are 3 basic approaches I'm going to take:
date
andtime
separate, and I can drop those 2 after I createdatetime
CATCHJOIN
orALTERATIONDESC
).That 3rd category is tricky, because it represents a loss of information relative to what is provided in raw data. That's what I want feedback on in this Issue: which columns from the raw data do I need to keep?
Below I'll make a list of columns, organized under a few categories. I'm open to any feedback as to which columns would be needed; I can add more options if something is suggested that I don't have, but I'll use checking a box as a way of indicating that I intend to keep the column. The goal is to have the package contain only 1 data set per region, and raw data available by download (possibly via a package function). In other words, if a column isn't included here, it won't be easily accessible elsewhere.
Most of the following columns will have the same name in all regions. Or there will be a similar equivalent in the regions that have it. If editing this list and adding a column that only needs to be included for a particular region and doesn't need to be included for other regions even if the column exists, please specify which region.
Time and Location of Sample
reg
year
season
datetime
lon
lat
stratum
(the region's definition, not my custom definition)haulid
Species ID and Characteristics
spp
common
sex
taxLvl
trophicLevel
trophicLevel.se
Additional Method Metadata
station
cruise
vessel
towduration
towarea
gearsize
geartype
comments
survey
(e.g., summer groundfish)Environmental and Sample Data
effort
stratumarea
btemp
stemp
depth
bsalin
ssalin
bdo
sdo
wind
wave
pressure
Biological Measurements
cnt
weight
length
cntcpue
wtcpue
NUMLEN
(neus only)Other
keep.row
row_flag
Many of the columns don't have values that change among every row. In particular, many of the "meta data" columns don't vary within a haul, and the species taxonomy columns don't change at all (across species or regions). Just like we save all the species taxonomy (etc) information in they
spp.key
data.table, we could save many of the haul- or cruise- specific information in separate data.tables. In fact, many of the raw data sets arrive in such a format, where environmental, survey, and biological data are separated. While this makes it less convenient to access the data, it makes it so that we can provide more information while staying under CRAN size limits. So there is definitely room to compromise.